A 30-staff IT consulting firm I sat with earlier this year was working through three AI vendor quotes on a procurement spreadsheet inherited from a software-licensing exercise. One row was labelled “training cost”. Two vendors had filled it in with headline numbers like “trained on 15 trillion tokens at a reported $100 million build cost”. The third had refused to disclose anything.
The owner spent the best part of an afternoon trying to compare those figures. None was a number she would ever pay. Every penny on every quote was inference cost, and the training row drove a comparison that did not exist.
The right procurement question is what the model costs to run, against the volume of work she expected to put through it.
What is inference cost?
Inference cost is the per-use charge a vendor applies every time an AI model runs and produces an output for you. It is variable, scales with volume, and shows up on your bill as per-token, per-call, per-GPU-second, per-image, or per-second-of-audio rates depending on the model. Training cost is the one-off price of building the model, paid by the vendor.
The technical separation is simple. Training teaches a model on vast datasets, adjusting billions of parameters until it behaves usefully. That happens once per model, on the vendor’s infrastructure, at a cost reportedly above £75 million for GPT-4 and similar for Gemini Ultra. Inference is what happens every time someone types a question afterwards.
From your side as an owner, only inference cost ever appears on an invoice. OpenAI, Anthropic, Google, AWS, and Microsoft have absorbed the training spend and recover it slowly through per-request rates. Asking which vendor’s training run was cheapest is like asking which restaurant’s kitchen was cheapest to build. You are buying meals.
Why does it matter for your business?
Inference cost is the only AI cost that lands on your bill, and it behaves differently from any other software line you fund. Traditional software is fixed, a flat monthly fee regardless of usage. Inference is variable. If an AI feature gets unexpectedly popular, or a campaign drives a spike in customer queries, the bill rises in lockstep with the volume.
The framing matters for a second reason. Per-token prices have fallen by a factor of 280 in two years on equivalent capability, yet enterprise AI bills have grown roughly 320% over the same period. Usage has grown faster than unit cost has fallen. Agentic workflows, where a model reasons through several steps and tool calls per task, consume 5 to 30 times more tokens per outcome than the simple chatbots of two years ago.
The competitive advantage in 2026 goes to the firm consuming the fewest tokens per business outcome, not the firm paying the lowest price per token. A practice that routes simple classification to a cheap model, and pays for the frontier tier only on genuinely complex reasoning, will run a much smaller bill than one sending every request to the most powerful option.
Where will you actually meet it?
Inference cost takes five different shapes depending on the modality. The dominant one for language models is per-token billing, where vendors charge separately for input tokens, what you send the model, and output tokens, what it generates. Output usually costs 3 to 5 times more than input across OpenAI, Anthropic, and Google. Image, audio, and self-hosted compute use different structures again.
For the major language model providers in 2026, per-million-token rates cluster into three tiers. Budget models such as GPT-5 mini, Gemini Flash-Lite, and Claude Haiku run roughly £0.10 to £0.60 per million input tokens. Mid-tier models such as Claude Sonnet 4.6 and Gemini 2.5 Pro sit around £1 to £3 input. Frontier models such as Claude Opus and the GPT-5 reasoning tier run £3 to £15 input, with output two to four times higher. Always check the pricing page on the day you quote.
Generative image and video are billed per output unit, typically £0.02 to £0.15 per image and a few pence per second of generated video. Speech and audio use per-minute or per-token rates, with roughly 25 audio tokens equal to one second. Self-hosted open-weight models on Together, Replicate, Modal, or Fireworks bill per GPU-hour, with H100 hire around £4 an hour at current rates.
Two cost levers sit on top of the headline rate and are worth knowing about. The batch API on OpenAI, Anthropic, and AWS Bedrock runs your work asynchronously, usually within 24 hours, at a flat 50% discount on input and output. Prompt caching gives roughly a 90% discount on the unchanging prefix of a long prompt, useful where you reuse a fixed system prompt. Both stack and both are vendor settings, not architecture changes.
When to ask about it, when to ignore it
Below £100 a month total spend, ignore inference cost and pick the model that does the job well. The management time you would spend optimising it costs more than the savings. Between £500 and £2,000 a month, it is worth taking seriously: model routing, batch processing for non-urgent work, and prompt caching for stable prompts can together cut the bill by 40 to 60%.
Above £2,000 a month, the API to self-hosted inflection comes into view. Below 10 million tokens a month, cloud APIs are almost always cheaper once you count the engineering overhead of running your own inference servers. Between 10 and 100 million tokens, the calculation depends on whether you need frontier capability, only available on the proprietary APIs, or whether an open-weight model on Llama, Mistral, or DeepSeek will do. Above 100 million tokens, self-hosting on rented GPU usually wins. Few UK SMEs at £1 million to £10 million turnover reach that point.
One risk is worth pricing into your 2027 budget. Current API rates are partly subsidised by venture capital and hyperscaler cross-funding, and several analysts expect a 30 to 50% normalisation within 12 to 24 months as capital discipline tightens. If your run rate today is £5,000 a month, plan for £7,000 from mid-2027 and you will not be surprised. Get any price commitment from your vendor in writing.
The questions to put to a vendor at quote time are direct. What do you charge per million input and output tokens at each model tier? Is batch available at the standard 50% discount? Is prompt caching supported, and at what minimum prefix size? Can I route between models on the same account? What guarantees on pricing stability are in writing for the next 12 months? Vague answers mean you are not yet making an informed cost decision.
Related concepts
Tokens are the unit AI vendors meter and bill on. Roughly three-quarters of a word in English, they are the counter behind every per-token rate. The explainer covers how the count actually works, including why a 1,000-word email is closer to 1,300 tokens.
Input and output tokens are billed separately, with output usually 3 to 5 times more expensive than input. That gap is why a workflow producing long-form generation is structurally more costly than one returning a short JSON classification.
Context window is the maximum number of tokens a model can process in a single request, currently 200,000 on Claude and around one million on Gemini Pro. Inference cost rises with input length, so a long window is a capability lever, not a free upgrade.
Prompt caching is the largest single cost lever for any workload that reuses a long, stable prefix many times in a short window. It runs at roughly 10% of standard input cost on the cached portion and stacks with batch processing.
Rate limits, per-seat versus usage-based pricing, total cost of ownership, and vendor lock-in are the rest of the cost surface and are covered in sibling posts. If your AI bill is rising and you do not know which lever to pull first, model routing followed by prompt caching is almost always the answer.



