How output tokens drive the real cost of AI responses

A business owner sitting at a desk reviewing a printed document with a laptop open in front of them
TL;DR

Output tokens, every word an AI model sends back to you, are billed at three to five times the rate of input tokens by the major API providers. Owner-managed businesses using AI for drafting, reporting, or client communications are particularly exposed because those tasks produce short inputs and long replies. Setting explicit length caps in prompts and routing simpler tasks to cheaper model tiers can cut token spend by 30 to 70 per cent without quality loss.

Key takeaways

- Output tokens are billed at three to five times the rate of input tokens by the major AI API providers, a pricing differential that holds across budget, mid-range, and premium model tiers. - Common business tasks such as drafting proposals, client emails, and policy documents are long-output by nature, which means output costs dominate the bill even when prompts are kept short. - Hidden tokens from system instructions, tool definitions, and conversation histories can add 20 to 40 per cent to your actual token count in production use, without appearing explicitly in usage reports. - Setting explicit word or bullet limits in every prompt is the single most practical step to reduce per-request cost, and combining it with model routing can cut token spend by 30 to 70 per cent on like-for-like tasks. - Flat-fee SaaS products, short classification tasks, and embedding-only workflows are not directly affected by output token pricing, so the optimisation effort belongs where you are using APIs directly.

The invoice came through on a Tuesday afternoon. An owner of a professional services firm in Leeds had been using an AI writing tool for six weeks, mostly for drafting client reports and proposal outlines. She had estimated the monthly cost at around £30, based on a rough calculation her developer had done at the start. The actual figure was four times that. The line item said “API usage, 1.2 million tokens.” The word “tokens” appeared nowhere in the original briefing.

The gap between estimate and reality almost always comes down to the same thing. The cost of an AI response is shaped more by what the model sends back than by what you send in.

What is an output token?

When you send a request to an AI model, the system converts your text into small units called tokens. Your request makes up the input tokens. Every word of the model’s response is an output token. Both sides are metered and billed separately by API providers, and the two rates are not equal. Understanding this split is the foundation of understanding your AI costs.

A token is roughly three to four characters of English text: part of a word, a punctuation mark, or a short common word on its own. A 200-word document converts to around 250 to 280 tokens. Your instructions, any documents you paste in, and previous messages in the thread all count as input tokens. The full text of the reply is output. MindStudio’s documentation illustrates this with a worked example: a 500-token prompt plus a 200-token response in GPT-4, at $0.01 per thousand input tokens and $0.03 per thousand output tokens, comes to $0.011 for that single exchange. Scaled across hundreds of daily requests, those fractions accumulate quickly.

Why does the output side cost more than the input?

Generating each output token requires more computation than reading an input token. For every word it produces, the model runs a full forward pass through its parameters. That extra work is reflected directly in the price. Typical API pricing in early 2026 put input tokens at $0.15 to $5.00 per million, and output tokens at $0.60 to $25.00 per million, a three-to-five times differential that holds across budget, mid-range, and premium model tiers.

The practical consequence for owner-managed businesses is straightforward. Drafting a client report, writing a proposal, or summarising a set of meeting notes involves a short input and a long reply. In those workloads, the output side accounts for the majority of the cost on every call, regardless of how concisely the prompt is written. The differential is consistent across providers, built into the economics of how these APIs are structured. A two-word change to your prompt is unlikely to shift the bill significantly. A decision to cap response length at 150 words rather than letting the model run free almost certainly will.

Where do output tokens actually add up for a service firm?

Content creation is where output token costs land hardest for service businesses. A prompt asking for a proposal draft might run to 200 tokens. The resulting draft might run to 800 or 1,000 tokens. Run that pattern across a team producing several documents a day and the monthly total climbs quickly, often to two or four times whatever estimate was put together at the start.

There are also sources that are less obvious. Conversation history is resent in full with every new message in many AI tools, growing with each exchange. System instructions, which configure how the AI behaves, add to every request without appearing in what you type. Tool definitions, which describe functions the model can call, contribute another layer. Practitioner guides and provider documentation suggest these hidden inputs add 20 to 40 per cent to the actual token count in production use, without appearing anywhere in the usage report you see.

The CMA’s work on AI foundation models reinforces the importance of understanding pricing structures before you commit. Token-based billing sounds transparent, but different providers use different tokenisers. The same block of text can produce different token counts in OpenAI’s system, Anthropic’s system, and Google’s system. A headline price per million tokens is therefore not a direct comparison across providers without running a test on your own content.

When does the output token split not apply to your costs?

If your AI tools are priced as a flat monthly subscription rather than on usage, the token economics are absorbed by the vendor and do not appear as a separate cost for you. Short classification tasks, such as labelling a customer query or running a sentiment check on an email, produce very few output tokens, so the output differential has a negligible effect on your bill.

Embedding-only workflows, where you use AI to build a searchable index from a document library, often have no output text at all. The pricing in those cases is input-only. If you have an AI feature that returns a single word or a short label rather than a paragraph, the same logic applies: input tokens will account for most of the cost. The question to ask before worrying about output optimisation is simple: does the model produce substantial amounts of text in this use case? If yes, the output token differential matters. If not, it is an internal issue for the vendor, not a line item on your invoice.

What else do you need to understand to manage token costs?

The most effective starting move is to set explicit output length caps in every production prompt. Instructing the model to reply in under 150 words, or to produce a bullet list capped at six items, directly reduces the token count on every call. Pair this with a request to your vendor or technical contact for a per-workflow breakdown of input and output tokens.

Model selection is the second lever. Every AI provider offers model tiers at significantly different price points, from under a dollar per million tokens to $75 or more. A premium model is appropriate for complex reasoning tasks, but running routine drafting or summarisation through that same tier is expensive by default. Routing simpler tasks to a more affordable model can cut per-request costs substantially without a change in output quality your team would notice. Combining prompt length controls, history summarisation, and model routing can reduce token spend by 30 to 70 per cent on like-for-like tasks, according to practitioners including 10Clouds.

Managing conversation history is the third practical lever. Rather than allowing AI tools to resend full conversation threads with every new message, configure the tool or instruct the model to work from a short summary of earlier exchanges. This applies to any tool that supports multi-turn conversations.

There are also regulatory dimensions worth registering alongside the cost ones. The ICO’s guidance on AI and data protection notes that outputs containing personal data, such as AI-drafted HR notes or client-specific reports, fall under UK GDPR’s data minimisation principle. Keeping outputs to what is genuinely needed serves both your cost controls and your data handling obligations. The NCSC’s guidance on using public generative AI safely makes a related point: managing what the model sends back is part of managing your organisation’s exposure, not just its invoice. For regulated firms, the FCA’s guidance on outsourcing and third-party risk management is also relevant: cost spikes from uncontrolled AI output are an operational risk, not simply a budget inconvenience.

If you want a concrete starting point, ask your technical contact or vendor to pull the average input and output token count for each workflow you run. That single number will tell you more about where your AI spend is going than any amount of headline pricing comparison.

Sources

- ICO (2023). AI and data protection. UK regulator guidance on how prompts and outputs containing personal data fall under UK GDPR, including the data minimisation principle relevant to controlling output length and handling. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ai-and-data-protection/ - NCSC (2024). Using public generative AI services safely. UK government guidance on managing data exposure and cost when using generative AI services, including the role of output controls in operational risk management. https://www.ncsc.gov.uk/guidance/using-public-generative-ai-safely - FCA (2021). Outsourcing and third-party risk management. Guidance for regulated firms on managing operational risk from third-party services, including cost spikes from uncontrolled AI usage as part of resilience planning. https://www.fca.org.uk/firms/outsourcing-and-third-party-service-providers - CMA (2023). AI foundation models: initial report. Examines competition and pricing transparency in AI markets, relevant to understanding token-based billing structures and the risk of provider lock-in. https://www.gov.uk/government/publications/ai-foundation-models-initial-report - European Commission (2024). The Artificial Intelligence Act: ensuring safe and trustworthy AI. Sets transparency and risk-management obligations informing vendor disclosure standards relevant to model pricing and behaviour. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai - NVIDIA (2025). Rethinking AI TCO: why cost per token is the only metric that matters. Industry analysis arguing that cost per million tokens is the defining economic unit for AI infrastructure, with data on generation-over-generation pricing shifts. https://blogs.nvidia.com/blog/lowest-token-cost-ai-factories/ - MindStudio (2025). What is token-based pricing for AI models? Pricing band data for input and output tokens across model tiers as of January 2026, including worked examples and the finding that prompt phrasing affects output token count. https://www.mindstudio.ai/blog/token-based-pricing/ - 10Clouds (2025). Mastering AI token optimisation: proven strategies to cut AI cost. Practitioner guide on token reduction techniques including prompt length constraints, history summarisation, and model routing, with 30 to 70 per cent savings estimates. https://10clouds.com/blog/a-i/mastering-ai-token-optimization-proven-strategies-to-cut-ai-cost/ - Pivot Point Security (2025). AI tokens and how they impact usage costs, explained. Accessible breakdown of how tokens affect API billing and practical guidance on reducing unnecessary token usage in production systems. https://www.pivotpointsecurity.com/falling-behind-on-cmmc-compliance-heres-how-to-catch-up-fast-2-2-2-2/

Frequently asked questions

What is the difference between input tokens and output tokens?

Input tokens are everything you send to the AI model: your prompt, any pasted documents, and previous messages in the thread. Output tokens are every word of the model's response. API providers typically charge three to five times more for output tokens because generating each one requires the model to run a full calculation, a computational step that reading input tokens does not require.

Why is my AI API bill higher than I expected?

Real-world token bills commonly run two to four times higher than initial estimates for three reasons: output tokens cost more than input tokens, hidden tokens from system instructions, tool definitions, and conversation history can add 20 to 40 per cent to your total, and models produce longer responses than prompted without explicit length caps. Measuring average input and output tokens per workflow is the first step to understanding where the spend is going.

How can I reduce output token costs without losing quality?

Set explicit length caps in every production prompt, for example "reply in under 150 words" or "bullet list, maximum six items". Summarise conversation history rather than resending full threads. Route straightforward tasks to a cheaper model tier. Together, these steps can reduce token spend by 30 to 70 per cent on like-for-like tasks, according to optimisation guidance from practitioners including 10Clouds.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation