A 12-staff legal practice I worked with last month was running two AI features in parallel on the same vendor account. Feature one read 50-page client contracts and produced a two-page summary. Feature two drafted routine client correspondence. Headline pricing was identical for both, three dollars per million input tokens and fifteen per million output, on Claude Sonnet 4.6.
The bills were not identical. The contract summariser was eating the bulk of the monthly spend; the email drafter barely registered. The managing partner assumed the summaries were just doing more work. The actual reason was structural, sitting one layer below the per-token rate the procurement team had been comparing.
What are input vs output tokens?
Input tokens are everything you send to the model. Your prompt, the system prompt sitting underneath it, any retrieved documents your application has fetched, and the conversation history if there is one. Output tokens are everything the model sends back. Both are billed, on every major API in 2026, and across every vendor the output rate runs roughly four to six times higher than the input rate on the same model.
The plain-English version is that the bill has two columns and many owners only read one. Anthropic’s published rates show Claude Sonnet 4.6 at three dollars per million input and fifteen per million output. Claude Haiku 4.5 lists at one dollar input and five output. Google’s Gemini 2.5 Pro is listed at $1.25 input and $10 output below 200,000 tokens of context, with both rates doubling above that boundary. The ratio holds regardless of which provider you pick.
Why does the gap exist?
The 4-6x premium reflects how the compute actually works, not a vendor margin decision. Input is processed in parallel: the model reads your whole prompt in a single pass. Output is generated sequentially, one token at a time, with each new token requiring a fresh pass over the model’s parameters and a fresh read from a memory cache. Memory bandwidth, not raw compute, is the bottleneck on the decode side.
The gap is also unlikely to close in the next model generation. New architectures speed up both sides, but the asymmetry is structural. Any plan that assumes output will get cheaper relative to input next year is a plan based on a hope, not on engineering. When you are sizing the economics of an AI feature, the input-to-output ratio of your workload is the number you design around, not the headline per-token rate.
The other consequence is that two features with the same vendor, the same model, and the same total token volume can produce different bills. A contract summariser sending 25,000 tokens of input to get a 2,000-token summary has input cost roughly eight times its output cost. An email drafter sending an 800-token brief to get a 600-token draft is roughly balanced. Same rate, same volume, two different cost shapes.
Where will you actually meet it?
You will meet input and output rates in three places: the vendor’s public pricing page, your own usage console, and your bill. Anthropic, OpenAI, and Google all list input and output as separate line items on their pricing tables, and the ratio is visible at a glance once you know to look for it. The console and the bill are where the surprises live.
Inside the console, the breakdown is finer than the pricing page suggests. Anthropic’s Claude API console reports uncached input, cached input, cache-creation, and output as four separate columns. OpenAI’s usage dashboard splits prompt tokens and completion tokens. AWS Bedrock’s cost and usage report splits input, output, cache-read, and cache-write into four line items, precisely because the per-token economics differ across all four.
The bill is where the surprise usually lands, after the first month. The line item that catches people is almost always the same one: a customer support bot with an 8,000-token system prompt defining tone, policy, and few-shot examples pays that 8,000-token cost on every user query, even when the query is fifty tokens and the response is a hundred. The system prompt is input, so it is the cheaper side, but it recurs at full volume on every request.
The dashboard is where the optimisation signal lives. The pricing page tells you which vendor is cheapest in theory. The dashboard tells you which feature is expensive in your account, and which lever will move the bill.
When should you care, and when can you ignore it?
Care immediately if you are building a feature where the input side recurs at scale. Customer support bots, RAG-powered internal search, contract review, document Q&A, and agentic systems all sit in this category. The lever that moves the bill is retrieval precision and prompt caching, where Anthropic and OpenAI both offer 50% to 90% discounts on cached input for repeated long contexts.
Care also if your feature is output-heavy: long-form drafting, creative generation, reasoning models with hidden chain-of-thought, or any LLM workflow that lets the model run long. The lever there is a max_tokens cap, a tier-down to a smaller model where the task allows, and a structured-output schema that constrains the response shape. A support bot that pays £5 a month on cached system prompts instead of £100 on uncached ones is a materially different business case.
Ignore it if your monthly token spend is under roughly £100 and the workload is occasional rather than continuous. The optimisation effort costs more than the savings at that scale. Ignore it also if you are negotiating volume discounts at over £1,000 a month, where the per-token rate becomes less important than the negotiated agreement, the priority routing, and the support tier.
There is one trap that catches careful people anyway. Reasoning models charge for their internal “thinking” tokens at the output rate, even though those tokens are invisible to the user. One published cross-vendor benchmark found Gemini 3 Flash, listed at 78% cheaper than GPT-5.2 per token, actually costing 22% more in production because it generated three times the output tokens, much of it hidden reasoning. Benchmark on your real workload before switching to a cheaper-listed model.
Related concepts you will meet next
A token is the smallest unit of text the model sees, roughly four characters or three-quarters of an English word. The input/output asymmetry sits on top of the token concept; if a vendor talks about “millions of tokens”, they almost always mean millions of input or millions of output, not a blended figure.
A context window is the total amount of input the model can hold in working memory in a single request. Larger context windows let you feed in more retrieved documents or longer conversation histories, but they push input volume up at full input rates, and on some vendors they cross a tier boundary that doubles the per-token rate. Sizing the context window to the workload is a different decision from choosing the model.
Prompt caching is the lever for input-heavy workloads. When the same long context (a system prompt, a fixed knowledge base extract) is sent across many queries, vendors offer a discounted “cached” input rate, often around 10% of the standard input rate after the first call. For a support bot with a stable system prompt and high query volume, caching can cut effective input cost by 80-90%.
Model tier selection is the lever for output-heavy workloads. Claude Haiku is roughly five times cheaper per output token than Claude Sonnet, and Sonnet is roughly five times cheaper than Opus. Routing simple classification and extraction queries to the smallest model that handles them, and reserving the larger model for genuine reasoning work, often saves more than every other lever combined.
The point of all of this is to give you a procurement question your vendor cannot dodge. Ask them what your input-to-output ratio looks like on their platform for the workload you are building. If they cannot tell you, they do not know their own unit economics, and the bill that arrives next month will tell you so.



