A customer-services director at a 40-staff UK marketing agency opened her OpenAI invoice last quarter and stared at a number that had tripled. “I thought we hired a consultant to fine-tune our model,” she asked her team. “Why hasn’t that fixed the cost?”
Nobody had told her the bill almost entirely reflected inference, the cost of running queries, not training. She had paid £8,000 for a fine-tune to solve a recurring inference problem fine-tuning was never going to fix. The agency was processing roughly 10,000 daily support queries through GPT-5.4 Mini, the bill sat near £518 a month and growing, and the consultant had quoted training as the answer to a question the invoice never asked.
The directors I sit with in 2026 have plenty of AI vocabulary. They need the one distinction that decides whether the budget holds. Read training and inference apart on the invoice and the bill stops being a mystery.
What is the difference between training and inference?
Training is the upfront work of teaching a model on examples, almost always absorbed by vendors before you ever see a price. Inference is the recurring cost of running that trained model on every query. Training is capital-style spend, paid once. Inference is operational spend, paid on every customer reply, every drafted proposal, every document the system reads. For UK SMEs on vendor APIs, the split is roughly 90% inference, 10% everything else.
The numbers behind training are why it never lands on your invoice. OpenAI’s GPT-4 training run is reported between £78 million and £100 million. Gemini Ultra is estimated at around £191 million. Vendors absorbed those costs before any SME sent a query, and recover them slowly through per-token rates spread across millions of customers. You pay only for the replies the model generates after it was trained.
Every vendor pricing schedule in 2026 is an inference schedule. OpenAI ranges from £0.20 per million input tokens on GPT-5.4 Nano to £30.00 on GPT-5.4 Pro, output four to six times higher. Anthropic Claude Haiku sits at £1.00 input and £5.00 output, Sonnet 4.6 at £3.00 and £15.00, Opus 4.6 at £5.00 and £25.00. Google Gemini runs £0.10 to £4.00. AWS Bedrock follows the same per-token shape. No SME-facing training column appears anywhere.
When is training the right answer for an SME?
Almost never as the model itself, sometimes as fine-tuning when the task is narrow, stable, and high-volume. Frontier model training is excluded by cost, with a floor in the tens of millions. Fine-tuning is the one tier of training-style work an SME can plausibly fund, at £5,000 to £30,000 in data labelling plus £2,000 to £10,000 in compute. It repays in a thin set of conditions.
Those conditions are specific. The task has to be narrow, the same shape of input producing the same shape of output. Stable, with criteria that don’t change every few months. High-volume, typically 50,000 or more queries a month on the same task. Classification, extraction, and well-defined drafting jobs at scale are the canonical examples. A LoRA fine-tune on a 7B open-weight model can land at £500 to £5,000 and match a larger paid model.
For typical UK service businesses the conditions rarely line up. Service work is diverse. One day the AI is routing customer questions, the next drafting proposals, the next reviewing contracts. A UK consulting firm fine-tuned a model on previous proposals at £3,000. The result was slower, sometimes produced firm-specific jargon that was technically wrong, and needed monthly retraining. Six months later cumulative damage was around £2,500 against baseline, on work prompt engineering would have handled for £500.
When is inference the right answer? (Almost always.)
Inference is the right answer for nearly every UK SME AI deployment in 2026, because the levers that actually move the bill are all inference levers. Three of them dominate. Prompt caching at up to 90% off cached reads on Anthropic, 50% on OpenAI. Batch processing at 50% off real-time pricing on OpenAI, Google, and Anthropic for workloads that tolerate 12 to 24 hour latency. Model routing across budget, mid, and premium tiers.
The arithmetic on caching alone is striking. A customer support system with a 50,000-token product manual and 1,000 daily queries costs around £20 a day on standard input pricing. Turn caching on at a 99% hit rate and the same workload runs at roughly £0.30 a day, a 98% drop. Batch processing applies to anything that does not need a real-time answer, document pipelines, overnight classification, periodic reports, and halves the cost with no rework.
Model routing is the lever many SMEs underuse. Token pricing varies by roughly 150 times across the spectrum. Sending every query to GPT-5.4 Pro pays premium rates for tasks a budget model solves at 80% quality for 5 to 10% of the cost. The working pattern: classify by complexity, route 70 to 80% to budget, 15 to 25% to mid, reserve 5 to 10% for premium. One startup documented a fall from £12,000 a month to £1,440 after stacking the three.
What does it cost to get this wrong?
The cost of confusing the two is often five to ten times the original AI spend. An SME approves a £5,000 to £8,000 fine-tune as a one-off, expects the cost to plateau, and watches the inference bill compound month after month. A West Midlands manufacturer approved an £8,000 fine-tune for computer vision, then found the system was processing 50,000 images a day at £4,000 a month.
Recovery is rarely free. By the time the bill arrives, the architecture is committed. The Midlands firm did claw it back. Batch processing on overnight runs halved the rate. Prompt caching on historical lookups cut another 90% off that slice. Routing simpler product variants to a cheaper model brought spend from £4,000 to £1,200 inside three months. The £8,000 fine-tune paid for itself in three weeks once the inference levers were doing the work.
The deeper failure is vocabulary. The owner is buying inference and being told they are buying training. A consultant who quotes fine-tuning as a cost-control answer to a recurring inference problem is either confused about the cost shape or commercially incentivised not to explain it. Once the owner names which side of the split they are on, questions get sharper and deflections get harder.
What to ask any AI vendor before you sign
Five questions reveal a vendor’s cost structure inside an hour, and they are worth asking before the contract is signed. Start with the billing model. Per-token scales with query volume. Per-call bills a fixed amount per conversation. Per-seat charges by user, which encourages vendors to push hiring rather than automation. Per-resolution charges only when the AI solves the problem end-to-end without escalation.
The next three cover the inference levers. Does the platform support prompt caching, and at what discount? Anthropic’s 90% on cached reads versus OpenAI’s 50% is a real difference at volume. Does it support batch processing, and at what discount? The standard is 50%. If your volume scales by 10 times in year one, does the cost model change? A vendor offering attractive on-demand pricing today may push you onto provisioned throughput once you scale. Get the answer in writing.
The fifth question separates the serious vendors from the rest. Can the platform separately bill or report training-related and inference-related costs, so the owner can see which component drives the spend? Vendors who can do that cleanly are signalling commercial maturity. The decision sequence on your side is simple. Start with an off-the-shelf model and a clear prompt. If cost is unacceptable, layer caching and batching. If still unacceptable, layer routing. Only then should fine-tuning or self-hosting return to the table.
If you cannot tell which line on your AI invoice is training and which is inference, book a conversation and we can read the bill together.



