AI inference vs training: why your AI bill behaves the way it does

Two people at a desk reviewing a printed invoice next to a laptop showing a pricing page
TL;DR

Training is the one-time, capital-style cost of teaching a model, almost always absorbed by vendors. Inference is the recurring, operational cost of running the trained model on every query. For UK SMEs adopting AI through vendor APIs in 2026, roughly 90% of spend is inference and 10% is everything else combined. The cost-control levers that actually move the bill are all inference levers, prompt caching, batch processing, and model routing, not fine-tuning.

Key takeaways

- Training is paid once, by the vendor, before you ever interact with the model. Inference is paid every time you run a query, and never stops. - For UK SMEs adopting AI via vendor APIs, roughly 90% of AI spend is inference and 10% is everything else, sometimes 95% in mature deployments. - Fine-tuning is rarely a cost-control answer. It only repays when the task is narrow, stable, and high-volume, after prompt engineering, caching, and routing have already been optimised. - Three inference levers dominate the bill: prompt caching (up to 90% off cached reads), batch processing (50% off non-real-time work), and model routing (30 to 50% off through tiered selection). - The API to self-hosted inflection point sits between 500,000 and 2 million monthly requests. Below it, APIs win. Above it, a dedicated GPU server at £2,000 to £5,000 a month usually wins.

A customer-services director at a 40-staff UK marketing agency opened her OpenAI invoice last quarter and stared at a number that had tripled. “I thought we hired a consultant to fine-tune our model,” she asked her team. “Why hasn’t that fixed the cost?”

Nobody had told her the bill almost entirely reflected inference, the cost of running queries, not training. She had paid £8,000 for a fine-tune to solve a recurring inference problem fine-tuning was never going to fix. The agency was processing roughly 10,000 daily support queries through GPT-5.4 Mini, the bill sat near £518 a month and growing, and the consultant had quoted training as the answer to a question the invoice never asked.

The directors I sit with in 2026 have plenty of AI vocabulary. They need the one distinction that decides whether the budget holds. Read training and inference apart on the invoice and the bill stops being a mystery.

What is the difference between training and inference?

Training is the upfront work of teaching a model on examples, almost always absorbed by vendors before you ever see a price. Inference is the recurring cost of running that trained model on every query. Training is capital-style spend, paid once. Inference is operational spend, paid on every customer reply, every drafted proposal, every document the system reads. For UK SMEs on vendor APIs, the split is roughly 90% inference, 10% everything else.

The numbers behind training are why it never lands on your invoice. OpenAI’s GPT-4 training run is reported between £78 million and £100 million. Gemini Ultra is estimated at around £191 million. Vendors absorbed those costs before any SME sent a query, and recover them slowly through per-token rates spread across millions of customers. You pay only for the replies the model generates after it was trained.

Every vendor pricing schedule in 2026 is an inference schedule. OpenAI ranges from £0.20 per million input tokens on GPT-5.4 Nano to £30.00 on GPT-5.4 Pro, output four to six times higher. Anthropic Claude Haiku sits at £1.00 input and £5.00 output, Sonnet 4.6 at £3.00 and £15.00, Opus 4.6 at £5.00 and £25.00. Google Gemini runs £0.10 to £4.00. AWS Bedrock follows the same per-token shape. No SME-facing training column appears anywhere.

When is training the right answer for an SME?

Almost never as the model itself, sometimes as fine-tuning when the task is narrow, stable, and high-volume. Frontier model training is excluded by cost, with a floor in the tens of millions. Fine-tuning is the one tier of training-style work an SME can plausibly fund, at £5,000 to £30,000 in data labelling plus £2,000 to £10,000 in compute. It repays in a thin set of conditions.

Those conditions are specific. The task has to be narrow, the same shape of input producing the same shape of output. Stable, with criteria that don’t change every few months. High-volume, typically 50,000 or more queries a month on the same task. Classification, extraction, and well-defined drafting jobs at scale are the canonical examples. A LoRA fine-tune on a 7B open-weight model can land at £500 to £5,000 and match a larger paid model.

For typical UK service businesses the conditions rarely line up. Service work is diverse. One day the AI is routing customer questions, the next drafting proposals, the next reviewing contracts. A UK consulting firm fine-tuned a model on previous proposals at £3,000. The result was slower, sometimes produced firm-specific jargon that was technically wrong, and needed monthly retraining. Six months later cumulative damage was around £2,500 against baseline, on work prompt engineering would have handled for £500.

When is inference the right answer? (Almost always.)

Inference is the right answer for nearly every UK SME AI deployment in 2026, because the levers that actually move the bill are all inference levers. Three of them dominate. Prompt caching at up to 90% off cached reads on Anthropic, 50% on OpenAI. Batch processing at 50% off real-time pricing on OpenAI, Google, and Anthropic for workloads that tolerate 12 to 24 hour latency. Model routing across budget, mid, and premium tiers.

The arithmetic on caching alone is striking. A customer support system with a 50,000-token product manual and 1,000 daily queries costs around £20 a day on standard input pricing. Turn caching on at a 99% hit rate and the same workload runs at roughly £0.30 a day, a 98% drop. Batch processing applies to anything that does not need a real-time answer, document pipelines, overnight classification, periodic reports, and halves the cost with no rework.

Model routing is the lever many SMEs underuse. Token pricing varies by roughly 150 times across the spectrum. Sending every query to GPT-5.4 Pro pays premium rates for tasks a budget model solves at 80% quality for 5 to 10% of the cost. The working pattern: classify by complexity, route 70 to 80% to budget, 15 to 25% to mid, reserve 5 to 10% for premium. One startup documented a fall from £12,000 a month to £1,440 after stacking the three.

What does it cost to get this wrong?

The cost of confusing the two is often five to ten times the original AI spend. An SME approves a £5,000 to £8,000 fine-tune as a one-off, expects the cost to plateau, and watches the inference bill compound month after month. A West Midlands manufacturer approved an £8,000 fine-tune for computer vision, then found the system was processing 50,000 images a day at £4,000 a month.

Recovery is rarely free. By the time the bill arrives, the architecture is committed. The Midlands firm did claw it back. Batch processing on overnight runs halved the rate. Prompt caching on historical lookups cut another 90% off that slice. Routing simpler product variants to a cheaper model brought spend from £4,000 to £1,200 inside three months. The £8,000 fine-tune paid for itself in three weeks once the inference levers were doing the work.

The deeper failure is vocabulary. The owner is buying inference and being told they are buying training. A consultant who quotes fine-tuning as a cost-control answer to a recurring inference problem is either confused about the cost shape or commercially incentivised not to explain it. Once the owner names which side of the split they are on, questions get sharper and deflections get harder.

What to ask any AI vendor before you sign

Five questions reveal a vendor’s cost structure inside an hour, and they are worth asking before the contract is signed. Start with the billing model. Per-token scales with query volume. Per-call bills a fixed amount per conversation. Per-seat charges by user, which encourages vendors to push hiring rather than automation. Per-resolution charges only when the AI solves the problem end-to-end without escalation.

The next three cover the inference levers. Does the platform support prompt caching, and at what discount? Anthropic’s 90% on cached reads versus OpenAI’s 50% is a real difference at volume. Does it support batch processing, and at what discount? The standard is 50%. If your volume scales by 10 times in year one, does the cost model change? A vendor offering attractive on-demand pricing today may push you onto provisioned throughput once you scale. Get the answer in writing.

The fifth question separates the serious vendors from the rest. Can the platform separately bill or report training-related and inference-related costs, so the owner can see which component drives the spend? Vendors who can do that cleanly are signalling commercial maturity. The decision sequence on your side is simple. Start with an off-the-shelf model and a clear prompt. If cost is unacceptable, layer caching and batching. If still unacceptable, layer routing. Only then should fine-tuning or self-hosting return to the table.

If you cannot tell which line on your AI invoice is training and which is inference, book a conversation and we can read the bill together.

Sources

Finout (2026). The new economics of AI, balancing training costs and inference spend. The canonical CapEx-versus-OpEx framing for the training and inference distinction. https://www.finout.io/blog/the-new-economics-of-ai-balancing-training-costs-and-inference-spend CloudZero (2026). OpenAI API pricing breakdown covering the £0.20 to £30.00 per million input token range across the GPT-5.4 family, with worked SaaS support workload examples. https://www.cloudzero.com/blog/openai-pricing/ Finout (2026). Anthropic API pricing covering Haiku, Sonnet 4.6, and Opus 4.6 per-token rates plus the 90% prompt caching discount and 50% batch discount. https://www.finout.io/blog/anthropic-api-pricing Finout (2026). Gemini pricing in 2026, covering the £0.10 to £4.00 per million token range across Gemini tiers and the batch mode discount. https://www.finout.io/blog/gemini-pricing-in-2026 AWS (2026). Bedrock pricing page covering on-demand and provisioned throughput pricing across Anthropic, Meta, and Mistral models with the 50% batch discount. https://aws.amazon.com/bedrock/pricing/ AISuperior (2025). Cost of training an LLM from scratch, with the £78 to £100 million GPT-4 and £191 million Gemini Ultra training cost benchmarks drawn from the Stanford AI Index. https://aisuperior.com/cost-of-training-llm-from-scratch/ Noqta (2026). AI API cost optimisation case study showing inference cost reduction from £12,000 a month to £1,440 a month, an 88% drop, via prompt caching, model routing, and an AI gateway. https://noqta.tn/en/blog/ai-api-cost-optimization-prompt-caching-model-routing-2026 TechAhead (2026). Inference cost explosion, the canonical reference for the 80 to 90% inference share of AI spend and the 5 to 25 times agentic flow cost multiplier. https://www.techaheadcorp.com/blog/inference-cost-explosion/ PE Collective (2026). RAG versus fine-tuning cost comparison, RAG at £350 to £2,850 a month versus fine-tuning at £2,400 upfront plus £800 to £3,000 a month in hosting and retraining. https://pecollective.com/blog/rag-vs-fine-tuning-cost/ Pegotec (2026). Self-hosted versus API breakeven analysis placing the inflection point between 500,000 and 2 million monthly requests, with dedicated GPU server cost at £2,000 to £5,000 a month. https://pegotec.net/self-hosted-vs-api-when-to-run-your-own-ai-models/

Frequently asked questions

If training is so expensive, why isn't it on my invoice?

Vendors absorb training cost and recover it slowly through per-token inference rates spread across millions of customers. OpenAI's GPT-4 training run is reported between £78 million and £100 million, Gemini Ultra around £191 million. None of that ever appears on an SME invoice. You are paying only for the cost of running the trained model on your specific queries, which is why 90% of your AI spend lands in the inference column.

Should I fine-tune to reduce my inference bill?

Almost never as a first move. Fine-tuning costs £5,000 to £30,000 in data labelling plus £2,000 to £10,000 in compute, and only repays when the task is narrow, stable, and high-volume. For typical service businesses the task evolves before the fine-tune pays back. Implement prompt caching, batch processing, and model routing first. If those three together still leave the bill unacceptable, then evaluate fine-tuning with data rather than assumptions.

At what volume does it become cheaper to self-host than use an API?

The breakeven sits between 500,000 and 2 million monthly requests, depending on model size and latency tolerance. A dedicated GPU server runs £2,000 to £5,000 a month and handles roughly 1 to 2 million inference requests in that band. Below the breakeven, vendor APIs are almost always cheaper once you count the engineering overhead. Many UK SMEs reach the inflection point 12 to 24 months after launching a successful AI feature.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation