What are input vs output tokens? Why it matters for your business

Two people at a desk reviewing a billing dashboard on a laptop together
TL;DR

Input tokens are everything you send the model, output tokens are everything it sends back, and across every major vendor in 2026 the output rate runs roughly four to six times higher than the input rate. Two AI features with the same headline pricing can produce wildly different bills. Knowing whether your workload is input-heavy or output-heavy tells you which lever to pull.

Key takeaways

- Input tokens (your prompt, system prompt, retrieved context, conversation history) are processed in parallel and are cheap. - Output tokens (what the model generates) are produced one at a time and cost roughly 4-6x more per token. - The premium reflects compute physics, not vendor margin, so the gap is unlikely to close with new model generations. - Input-heavy workloads (RAG, contract review, document Q&A) are moved by retrieval precision and prompt caching. - Output-heavy workloads (long-form drafting, reasoning models, agent loops) are moved by output length controls and model tier choice.

A 12-staff legal practice I worked with last month was running two AI features in parallel on the same vendor account. Feature one read 50-page client contracts and produced a two-page summary. Feature two drafted routine client correspondence. Headline pricing was identical for both, three dollars per million input tokens and fifteen per million output, on Claude Sonnet 4.6.

The bills were not identical. The contract summariser was eating the bulk of the monthly spend; the email drafter barely registered. The managing partner assumed the summaries were just doing more work. The actual reason was structural, sitting one layer below the per-token rate the procurement team had been comparing.

What are input vs output tokens?

Input tokens are everything you send to the model. Your prompt, the system prompt sitting underneath it, any retrieved documents your application has fetched, and the conversation history if there is one. Output tokens are everything the model sends back. Both are billed, on every major API in 2026, and across every vendor the output rate runs roughly four to six times higher than the input rate on the same model.

The plain-English version is that the bill has two columns and many owners only read one. Anthropic’s published rates show Claude Sonnet 4.6 at three dollars per million input and fifteen per million output. Claude Haiku 4.5 lists at one dollar input and five output. Google’s Gemini 2.5 Pro is listed at $1.25 input and $10 output below 200,000 tokens of context, with both rates doubling above that boundary. The ratio holds regardless of which provider you pick.

Why does the gap exist?

The 4-6x premium reflects how the compute actually works, not a vendor margin decision. Input is processed in parallel: the model reads your whole prompt in a single pass. Output is generated sequentially, one token at a time, with each new token requiring a fresh pass over the model’s parameters and a fresh read from a memory cache. Memory bandwidth, not raw compute, is the bottleneck on the decode side.

The gap is also unlikely to close in the next model generation. New architectures speed up both sides, but the asymmetry is structural. Any plan that assumes output will get cheaper relative to input next year is a plan based on a hope, not on engineering. When you are sizing the economics of an AI feature, the input-to-output ratio of your workload is the number you design around, not the headline per-token rate.

The other consequence is that two features with the same vendor, the same model, and the same total token volume can produce different bills. A contract summariser sending 25,000 tokens of input to get a 2,000-token summary has input cost roughly eight times its output cost. An email drafter sending an 800-token brief to get a 600-token draft is roughly balanced. Same rate, same volume, two different cost shapes.

Where will you actually meet it?

You will meet input and output rates in three places: the vendor’s public pricing page, your own usage console, and your bill. Anthropic, OpenAI, and Google all list input and output as separate line items on their pricing tables, and the ratio is visible at a glance once you know to look for it. The console and the bill are where the surprises live.

Inside the console, the breakdown is finer than the pricing page suggests. Anthropic’s Claude API console reports uncached input, cached input, cache-creation, and output as four separate columns. OpenAI’s usage dashboard splits prompt tokens and completion tokens. AWS Bedrock’s cost and usage report splits input, output, cache-read, and cache-write into four line items, precisely because the per-token economics differ across all four.

The bill is where the surprise usually lands, after the first month. The line item that catches people is almost always the same one: a customer support bot with an 8,000-token system prompt defining tone, policy, and few-shot examples pays that 8,000-token cost on every user query, even when the query is fifty tokens and the response is a hundred. The system prompt is input, so it is the cheaper side, but it recurs at full volume on every request.

The dashboard is where the optimisation signal lives. The pricing page tells you which vendor is cheapest in theory. The dashboard tells you which feature is expensive in your account, and which lever will move the bill.

When should you care, and when can you ignore it?

Care immediately if you are building a feature where the input side recurs at scale. Customer support bots, RAG-powered internal search, contract review, document Q&A, and agentic systems all sit in this category. The lever that moves the bill is retrieval precision and prompt caching, where Anthropic and OpenAI both offer 50% to 90% discounts on cached input for repeated long contexts.

Care also if your feature is output-heavy: long-form drafting, creative generation, reasoning models with hidden chain-of-thought, or any LLM workflow that lets the model run long. The lever there is a max_tokens cap, a tier-down to a smaller model where the task allows, and a structured-output schema that constrains the response shape. A support bot that pays £5 a month on cached system prompts instead of £100 on uncached ones is a materially different business case.

Ignore it if your monthly token spend is under roughly £100 and the workload is occasional rather than continuous. The optimisation effort costs more than the savings at that scale. Ignore it also if you are negotiating volume discounts at over £1,000 a month, where the per-token rate becomes less important than the negotiated agreement, the priority routing, and the support tier.

There is one trap that catches careful people anyway. Reasoning models charge for their internal “thinking” tokens at the output rate, even though those tokens are invisible to the user. One published cross-vendor benchmark found Gemini 3 Flash, listed at 78% cheaper than GPT-5.2 per token, actually costing 22% more in production because it generated three times the output tokens, much of it hidden reasoning. Benchmark on your real workload before switching to a cheaper-listed model.

A token is the smallest unit of text the model sees, roughly four characters or three-quarters of an English word. The input/output asymmetry sits on top of the token concept; if a vendor talks about “millions of tokens”, they almost always mean millions of input or millions of output, not a blended figure.

A context window is the total amount of input the model can hold in working memory in a single request. Larger context windows let you feed in more retrieved documents or longer conversation histories, but they push input volume up at full input rates, and on some vendors they cross a tier boundary that doubles the per-token rate. Sizing the context window to the workload is a different decision from choosing the model.

Prompt caching is the lever for input-heavy workloads. When the same long context (a system prompt, a fixed knowledge base extract) is sent across many queries, vendors offer a discounted “cached” input rate, often around 10% of the standard input rate after the first call. For a support bot with a stable system prompt and high query volume, caching can cut effective input cost by 80-90%.

Model tier selection is the lever for output-heavy workloads. Claude Haiku is roughly five times cheaper per output token than Claude Sonnet, and Sonnet is roughly five times cheaper than Opus. Routing simple classification and extraction queries to the smallest model that handles them, and reserving the larger model for genuine reasoning work, often saves more than every other lever combined.

The point of all of this is to give you a procurement question your vendor cannot dodge. Ask them what your input-to-output ratio looks like on their platform for the workload you are building. If they cannot tell you, they do not know their own unit economics, and the bill that arrives next month will tell you so.

Sources

Anthropic (2026). Claude API pricing page. Definitive source for Claude family input/output rates and prompt-caching tiers cited throughout. https://www.anthropic.com/api/pricing OpenAI (2026). API pricing. Authoritative source for GPT-5 family input/output split and reasoning-model premium tiers. https://openai.com/api/pricing/ Google (2026). Gemini API pricing. Source for the 200K context boundary that doubles input and output rates on Gemini 2.5 Pro. https://ai.google.dev/gemini-api/docs/pricing Anthropic (2026). Usage and Cost API documentation. Programmatic access to per-token-type usage including uncached input, cached input, cache creation, and output. https://platform.claude.com/docs/en/manage-claude/usage-cost-api Amazon Web Services (2025). Understanding cost and usage report data for Bedrock. Shows how input, output, cache-read, and cache-write tokens appear as four separate line items in CUR data. https://docs.aws.amazon.com/bedrock/latest/userguide/cost-mgmt-understanding-cur-data.html OpenAI (2024). What is the difference between prompt tokens and completion tokens? Plain-English vendor explainer for the input/output billing distinction. https://help.openai.com/en/articles/7127987-what-is-the-difference-between-prompt-tokens-and-completion-tokens Quimby, M. (2026). The hidden cost of cheap AI: why budget reasoning models actually cost 6x more. Cross-vendor benchmark showing 21.8% of model-pair comparisons reverse on real workloads once hidden reasoning tokens are counted. https://dev.to/max_quimby/the-hidden-cost-of-cheap-ai-why-budget-reasoning-models-actually-cost-6x-more-3e0 Vantage (2025). The real cost of agentic coding. Empirical breakdown of input/output ratios in long agentic sessions, where context accumulation pushes input to roughly 25:1 over output. https://www.vantage.sh/blog/agentic-coding-costs Microsoft Azure AI (2025). Context-aware RAG with Azure AI Search to cut token costs and boost accuracy. Vendor case study showing semantic chunking reduces input tokens by 80-85% versus naive full-document retrieval. https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/context-aware-rag-system-with-azure-ai-search-to-cut-token-costs-and-boost-accur/4456810 Datasør (2025). Reducing output tokens in large language model inference through smarter prompting. Empirical study finding output token reductions of 40-90% from temperature and prompt-conciseness tuning on certain task types. https://www.datasor.no/reducing-output-tokens-in-large-language-model-inference-through-smarter-prompting/

Frequently asked questions

My vendor only quotes one per-token price. Should I push for two?

Yes. Every major API provider in 2026 publishes input and output rates separately on their public pricing pages, so a single headline rate is either an averaged figure or a flat resale margin. Ask for the underlying split, the cache discount tiers, and the batch-API rate. If the vendor cannot answer, they probably do not know their own unit economics, and that is a procurement signal in itself.

Why does output cost more if it is fewer tokens?

Input is read by the model in a single parallel pass, so a thousand input tokens is roughly one heavy compute step. Output is generated one token at a time, with each new token requiring a fresh pass over the model's parameters and a fresh read from a memory cache. Memory bandwidth, not raw compute, is the bottleneck on the decode side. The 4-6x premium is what that physics costs in dollars and pence.

How do I tell if my workload is input-heavy or output-heavy?

Look at one typical request. Add up your system prompt, any retrieved context, and the user's question to get the input total. Then look at the response length you actually need. If the input is more than three times the output, you are input-heavy and should optimise retrieval and caching. If the output is anywhere close to the input or larger, you are output-heavy and should set max_tokens caps and benchmark a tier-down model.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation