What is chain-of-thought reasoning? When the premium earns its keep

A person at a desk reviewing a printed invoice next to an open laptop in a small office
TL;DR

Chain-of-thought reasoning makes an AI write out its intermediate steps before delivering a final answer. In 2024 to 2026 it stopped being a prompting trick and became a structural feature in OpenAI's o-series, Claude extended thinking, Gemini Deep Think, and DeepSeek-R1. The accuracy lift is genuine on hard, multi-step work, contract review, financial sanity checks, complex troubleshooting. On routine work it is overkill, slower and four to seventeen times more expensive for no quality gain.

Key takeaways

- Chain-of-thought is the model writing out its reasoning steps before the final answer. Originally a prompt trick, it is now a paid feature in frontier models with separate token billing. - The accuracy lift is task-specific. Reasoning effort dials lift accuracy 8 to 22 points on hard reasoning benchmarks while inflating cost 4 to 17x and latency 5 to 60x. - Reasoning tokens are billed at output rates and often hidden on the dashboard. A single complex query can generate ten to a hundred times more tokens than its standard equivalent. - The decision rule is per-workflow, not company-wide. Enable for high-stakes, low-frequency work like contract sign-off and compliance edge cases. Disable for high-frequency, low-stakes work like classification and routing. - Cost-per-correct-answer on your actual workload is the right metric. Published benchmarks are a starting point, not a substitute for measuring on the task you are paying for.

A 27-staff financial advisory firm flips on the new reasoning model in their compliance review pipeline. Quality goes up. The model spots edge cases the previous setup missed, including a misclassified disclosure that would have triggered an FCA query. Quality going up is what they wanted. Then the bill arrives. It is roughly seven times the previous month’s, and the line item that explains the spike is invisible on the front page of the dashboard.

A click into per-call detail surfaces the cause. The model is writing out an internal chain of reasoning before every answer, and those reasoning tokens are billed at output rates. For genuinely hard cases the working pays for itself. For the routine 80 to 85 per cent of reviews, a direct-answer model would have produced an identical conclusion at a fraction of the cost. The owner needs a rule for when the reasoning premium earns its keep, before the next monthly invoice doubles again.

What is chain-of-thought reasoning?

Chain-of-thought is the model writing out its intermediate working before delivering a final answer. The 2022 prompt trick was appending “let’s think step by step” and watching mathematical reasoning accuracy on GSM8K jump from 17.7 per cent to 78.7 per cent. By 2024 to 2026 the trick had become architecture in OpenAI’s o-series, Claude extended thinking, Gemini Deep Think, and DeepSeek-R1.

Practically, the model now produces two things on each call. A visible answer for the user, and a reasoning trace that captures the steps it took to get there. On Claude the reasoning trace is configurable through a thinking budget in tokens. On OpenAI it is exposed as a reasoning effort dial with minimal, low, medium, high, and extreme tiers. On Gemini it is the Deep Think mode. The trick became a paid feature, with its own line item on the bill.

Why the cost-quality tradeoff matters for your business

It matters because the accuracy lift and the cost shape do not move at the same rate. Digital Applied’s 2026 analysis found reasoning effort dials lift accuracy 8 to 22 points on hard reasoning benchmarks, while inflating per-call cost 4 to 17 times and latency 5 to 60 times. The lift is genuine on multi-step contract review and financial sanity checks. On routine classification, summarisation, and FAQ retrieval, it is rounding error.

The cost shape is what trips owners up. OpenAI o1 prices at roughly $15 per million input tokens and $60 per million output tokens, and reasoning tokens count as output. A single hard query can generate ten to a hundred times more tokens than its standard equivalent because the model is writing out an internal monologue. OpenAI’s 2024 inference spend reached around 2.3 billion dollars, fifteen times their training costs, for this reason. The bill arrives a month after the quality lift, and that delay is the trap.

The decision metric that matters is cost per correct answer on your actual workload. A 5 per cent accuracy lift is worth a 5x cost on a contract sign-off where one wrong answer costs you a client. The same 5 per cent lift on a ticket-classification call is pure waste. Published benchmarks like AIME and GPQA are a starting point. They are not a substitute for measuring on the work you are actually paying for.

Where you will meet it

You will meet chain-of-thought on every frontier vendor’s pricing page in 2026, and the surface language differs. OpenAI exposes a reasoning_effort parameter with five tiers. Anthropic exposes a thinking budget in tokens. Google offers Deep Think on Gemini 2.5 Pro and 3.0 Pro. DeepSeek-R1 reasons by default, which is why its per-token rates look cheap while the volume per query runs hot.

You will meet it in the system prompt you write to instruct the model. The same prompt that says “you are a compliance reviewer” can also say “use extended thinking for items flagged by the rule engine and respond directly otherwise”. That single line, applied per workflow, is usually a bigger cost lever than picking the cheapest vendor.

You will meet it in the conversation history of any multi-turn agent. Reasoning tokens compound across turns because the model regenerates working at each step. A 30-step support agent on a reasoning model can quietly burn through hundreds of thousands of output tokens for a single ticket. The same effect amplifies in function calling loops, where the model reasons about which tool to call, calls it, then reasons about the result, then calls another. Agentic loops with reasoning enabled by default is where bills go vertical.

When the premium earns its keep, when it is overkill

Enable reasoning when the cost of a wrong answer materially exceeds the cost of reasoning tokens. Three concrete patterns. Multi-step contract and compliance analysis on agreements with cross-references and indemnity clauses. Complex customer-support diagnostics where the model must form, test, and eliminate hypotheses. Financial sanity checks where reconciling variance requires holding several numbers at once. Concord reported 98 per cent contract-review accuracy using extended reasoning.

Disable it for high-frequency, low-stakes work. One practitioner field report from 2025 describes teams routing customer-support classification, FAQ retrieval, and simple code scaffolding through reasoning-class models, then discovering the outputs were no better than standard models, just slower and more expensive. The default should be the cheapest plausible model. Reasoning is enabled per workflow, not as a global toggle.

The decision rule that holds up. Profile your specific workload on a standard model and a reasoning model. Measure cost per correct answer on a sample of real cases. Use reasoning where it earns the premium. Leave it off where it does not. Set the policy at the workflow level so a single procurement decision is not paying twice for the same output across a hundred different workloads.

Tokens are the atomic unit on the bill, and reasoning tokens count as output tokens at output prices. The mapping from words to tokens is non-deterministic, and code, JSON, and structured reasoning traces inflate the count by 35 to 50 per cent relative to English prose. The full mechanics live in the what is a token post.

Reasoning effort dials are the fine-grained version of the toggle. Minimal, low, medium, high, and extreme on OpenAI; thinking budget in tokens on Claude. The 2026 finding is that on competition mathematics, high effort earns its keep because the answer is binary and verifiable. On code refactoring, medium peaks. On analytic reasoning, the curve plateaus mid-band. The right tier is task-specific.

Inference cost is where reasoning shows up in the operating bill. The cost curve is non-linear, the dashboard is often opaque, and the line items that explain the biggest spikes are the ones vendors do not surface prominently. The standalone explainer for owners is at what is inference cost.

Prompt caching can take the sting out of repeated reasoning prefixes. If your queries share a long system prompt, knowledge-base prefix, or tool definitions, caching the prefix at a 90 per cent discount on subsequent reads can recover a meaningful fraction of the reasoning premium. The mechanics live in what is prompt caching.

The honest framing for any owner sitting opposite a reasoning-model pitch is this. The benchmark numbers in the deck are real, and they are also the easy part. The hard part is naming the workflows where the premium earns its keep, the workflows where it is pure waste, and the dashboard discipline to keep an eye on the line item that does not show up on the front page.

Sources

OpenAI (2024). Learning to reason with LLMs. Introduction of the o-series and the inference-time reasoning shift. https://openai.com/index/learning-to-reason-with-llms/ Anthropic (2026). Claude extended thinking documentation. Thinking budget, billing of internal reasoning tokens, and use cases. https://platform.claude.com/docs/en/build-with-claude/extended-thinking Google DeepMind (2025). Accelerating mathematical and scientific discovery with Gemini Deep Think. Deep Think positioning and parallel reasoning paths. https://deepmind.google/blog/accelerating-mathematical-and-scientific-discovery-with-gemini-deep-think/ Kojima et al. (2022). Large language models are zero-shot reasoners. The original "let's think step by step" paper showing the 17.7% to 78.7% accuracy lift on GSM8K. https://arxiv.org/abs/2205.11916 Wei et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. The foundational chain-of-thought paper. https://arxiv.org/abs/2201.11903 Digital Applied (2026). Reasoning effort, cost vs quality benchmarks. The 8 to 22 point accuracy lift, 4 to 17x cost inflation, 5 to 60x latency findings across reasoning effort tiers. https://www.digitalapplied.com/blog/reasoning-effort-cost-vs-quality-benchmarks-2026 Concord (2025). AI contract analysis reaches a critical accuracy milestone. The 98% accuracy claim and 92 minutes to 26 seconds per-document figure for legal review. https://www.concord.app/blog/ai-contract-analysis-reaches-critical-accuracy-milestone DeepSeek (2026). API pricing. DeepSeek-R1 input and output token rates and reasoning model behaviour. https://api-docs.deepseek.com/quick_start/pricing Task Concierge (2025). Reasoning models are impressive, also overkill for 90 per cent of what you are building. Practitioner field report on routing routine tasks through reasoning models by mistake. https://dev.to/taskconcierge/reasoning-models-are-impressive-theyre-also-overkill-for-90-of-what-youre-building-12h4 AWS (2025). Real-world reasoning: how Amazon Nova handles complex customer support scenarios. Multi-turn diagnostic case study comparing standard and reasoning-enabled models. https://aws.amazon.com/blogs/machine-learning/real-world-reasoning-how-amazon-nova-lite-2-0-handles-complex-customer-support-scenarios/

Frequently asked questions

Is chain-of-thought the same as the new "reasoning" models from OpenAI and Anthropic?

It is the same idea, productised. The 2022 prompt trick was appending "let's think step by step" and watching the model write out its working. OpenAI's o1 and o3, Claude extended thinking, Gemini Deep Think, and DeepSeek-R1 are all trained to spend inference compute on extended reasoning chains by default. The mechanism is the same, the price is now itemised, and the dial is exposed as reasoning effort or thinking budget.

How do I work out whether reasoning is worth the cost on a specific workflow?

Run the same task on a standard model and a reasoning model. Score both for accuracy on a sample of real cases, log token spend per call, and divide cost by correct answers. If the cheaper model produces an acceptable answer, ship it. If the reasoning model only earns its premium on the hard 10 to 20 per cent of cases, route those alone through reasoning and leave the rest on the standard tier. The metric to track is cost per correct answer, not accuracy in isolation.

Why did my AI bill spike after enabling the new reasoning model?

Reasoning tokens are billed at output rates and many vendor dashboards do not surface them prominently. A single complex query can produce ten to a hundred times more tokens than its standard equivalent because the model is writing out an internal chain of working before answering. OpenAI's 2024 inference spend reached around 2.3 billion dollars, fifteen times their training cost, for this reason. Click into per-call token detail on the dashboard and you will usually find the hidden line item.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation