A 27-staff financial advisory firm flips on the new reasoning model in their compliance review pipeline. Quality goes up. The model spots edge cases the previous setup missed, including a misclassified disclosure that would have triggered an FCA query. Quality going up is what they wanted. Then the bill arrives. It is roughly seven times the previous month’s, and the line item that explains the spike is invisible on the front page of the dashboard.
A click into per-call detail surfaces the cause. The model is writing out an internal chain of reasoning before every answer, and those reasoning tokens are billed at output rates. For genuinely hard cases the working pays for itself. For the routine 80 to 85 per cent of reviews, a direct-answer model would have produced an identical conclusion at a fraction of the cost. The owner needs a rule for when the reasoning premium earns its keep, before the next monthly invoice doubles again.
What is chain-of-thought reasoning?
Chain-of-thought is the model writing out its intermediate working before delivering a final answer. The 2022 prompt trick was appending “let’s think step by step” and watching mathematical reasoning accuracy on GSM8K jump from 17.7 per cent to 78.7 per cent. By 2024 to 2026 the trick had become architecture in OpenAI’s o-series, Claude extended thinking, Gemini Deep Think, and DeepSeek-R1.
Practically, the model now produces two things on each call. A visible answer for the user, and a reasoning trace that captures the steps it took to get there. On Claude the reasoning trace is configurable through a thinking budget in tokens. On OpenAI it is exposed as a reasoning effort dial with minimal, low, medium, high, and extreme tiers. On Gemini it is the Deep Think mode. The trick became a paid feature, with its own line item on the bill.
Why the cost-quality tradeoff matters for your business
It matters because the accuracy lift and the cost shape do not move at the same rate. Digital Applied’s 2026 analysis found reasoning effort dials lift accuracy 8 to 22 points on hard reasoning benchmarks, while inflating per-call cost 4 to 17 times and latency 5 to 60 times. The lift is genuine on multi-step contract review and financial sanity checks. On routine classification, summarisation, and FAQ retrieval, it is rounding error.
The cost shape is what trips owners up. OpenAI o1 prices at roughly $15 per million input tokens and $60 per million output tokens, and reasoning tokens count as output. A single hard query can generate ten to a hundred times more tokens than its standard equivalent because the model is writing out an internal monologue. OpenAI’s 2024 inference spend reached around 2.3 billion dollars, fifteen times their training costs, for this reason. The bill arrives a month after the quality lift, and that delay is the trap.
The decision metric that matters is cost per correct answer on your actual workload. A 5 per cent accuracy lift is worth a 5x cost on a contract sign-off where one wrong answer costs you a client. The same 5 per cent lift on a ticket-classification call is pure waste. Published benchmarks like AIME and GPQA are a starting point. They are not a substitute for measuring on the work you are actually paying for.
Where you will meet it
You will meet chain-of-thought on every frontier vendor’s pricing page in 2026, and the surface language differs. OpenAI exposes a reasoning_effort parameter with five tiers. Anthropic exposes a thinking budget in tokens. Google offers Deep Think on Gemini 2.5 Pro and 3.0 Pro. DeepSeek-R1 reasons by default, which is why its per-token rates look cheap while the volume per query runs hot.
You will meet it in the system prompt you write to instruct the model. The same prompt that says “you are a compliance reviewer” can also say “use extended thinking for items flagged by the rule engine and respond directly otherwise”. That single line, applied per workflow, is usually a bigger cost lever than picking the cheapest vendor.
You will meet it in the conversation history of any multi-turn agent. Reasoning tokens compound across turns because the model regenerates working at each step. A 30-step support agent on a reasoning model can quietly burn through hundreds of thousands of output tokens for a single ticket. The same effect amplifies in function calling loops, where the model reasons about which tool to call, calls it, then reasons about the result, then calls another. Agentic loops with reasoning enabled by default is where bills go vertical.
When the premium earns its keep, when it is overkill
Enable reasoning when the cost of a wrong answer materially exceeds the cost of reasoning tokens. Three concrete patterns. Multi-step contract and compliance analysis on agreements with cross-references and indemnity clauses. Complex customer-support diagnostics where the model must form, test, and eliminate hypotheses. Financial sanity checks where reconciling variance requires holding several numbers at once. Concord reported 98 per cent contract-review accuracy using extended reasoning.
Disable it for high-frequency, low-stakes work. One practitioner field report from 2025 describes teams routing customer-support classification, FAQ retrieval, and simple code scaffolding through reasoning-class models, then discovering the outputs were no better than standard models, just slower and more expensive. The default should be the cheapest plausible model. Reasoning is enabled per workflow, not as a global toggle.
The decision rule that holds up. Profile your specific workload on a standard model and a reasoning model. Measure cost per correct answer on a sample of real cases. Use reasoning where it earns the premium. Leave it off where it does not. Set the policy at the workflow level so a single procurement decision is not paying twice for the same output across a hundred different workloads.
Related concepts
Tokens are the atomic unit on the bill, and reasoning tokens count as output tokens at output prices. The mapping from words to tokens is non-deterministic, and code, JSON, and structured reasoning traces inflate the count by 35 to 50 per cent relative to English prose. The full mechanics live in the what is a token post.
Reasoning effort dials are the fine-grained version of the toggle. Minimal, low, medium, high, and extreme on OpenAI; thinking budget in tokens on Claude. The 2026 finding is that on competition mathematics, high effort earns its keep because the answer is binary and verifiable. On code refactoring, medium peaks. On analytic reasoning, the curve plateaus mid-band. The right tier is task-specific.
Inference cost is where reasoning shows up in the operating bill. The cost curve is non-linear, the dashboard is often opaque, and the line items that explain the biggest spikes are the ones vendors do not surface prominently. The standalone explainer for owners is at what is inference cost.
Prompt caching can take the sting out of repeated reasoning prefixes. If your queries share a long system prompt, knowledge-base prefix, or tool definitions, caching the prefix at a 90 per cent discount on subsequent reads can recover a meaningful fraction of the reasoning premium. The mechanics live in what is prompt caching.
The honest framing for any owner sitting opposite a reasoning-model pitch is this. The benchmark numbers in the deck are real, and they are also the easy part. The hard part is naming the workflows where the premium earns its keep, the workflows where it is pure waste, and the dashboard discipline to keep an eye on the line item that does not show up on the front page.



