A 30-staff professional services firm I work with trialled an AI email-drafting feature for the senior team. Month one came in at £18, well within budget. Three months in, the finance director flagged the bill at £170. Same six users, same daily volume, no obvious change in usage. The investigation took an afternoon. The system prompt had grown from 200 tokens to 800 because three different team leads had each added their own instructions. A new validation check was firing the retry logic on roughly a third of calls. And the team had quietly enabled a “reasoning” model that charges for invisible chain-of-thought at output rates.
None of those changes were strategic decisions. Each one was a small tweak that compounded. The director’s question was the right one: “What is a token, and why is it costing us nine times what it did in March?”
What is a token?
A token is the smallest billable unit of text an AI language model processes. It is not a word and not a character. In English a token is roughly four characters, or about three-quarters of an average word, so a 1000-word business document tokenises to around 1300 tokens. Common words become single tokens. Rarer words split into two or three. Punctuation, spaces and numbers each count separately.
The model breaks your input into tokens, runs the tokens through the maths it learnt during training, and generates output tokens one at a time. Vendors charge per token in both directions. The two main tokenising algorithms are Byte-Pair Encoding, used by OpenAI, and SentencePiece, used by Anthropic and Google. The mechanics are different enough that the same document tokenises slightly differently across vendors, which means a price-per-million-tokens comparison between Claude and Gemini is never quite apples to apples.
You do not need to understand the algorithms. You need to know that a token is the unit, four characters is the rule of thumb, and the same text costs different amounts on different vendors before any pricing decision has been made.
Why does it matter for your business?
It matters because tokens are the lens through which every architectural decision in your AI setup shows up on the bill. The headline price on a vendor pricing page is one input. The other inputs are choices your team makes, often without realising they are pricing decisions.
Three asymmetries do most of the work. Output tokens cost four to six times more per token than input tokens across every major vendor, because generating text sequentially is computationally more expensive than reading it in parallel. Cached tokens, ones reused across requests, cost ten to twenty percent of the standard input rate, which makes architecture matter as much as model choice. And reasoning models, the newer generation that “think” before answering, charge for the invisible thinking tokens at full output rates, so a 200-word answer can carry 1000 tokens of billable thinking behind it.
For a UK services business at £1m to £10m turnover, token cost is rarely the binding constraint. At sensible volumes the bill sits well under one percent of the value the AI is producing. The owners who get burned are the ones who do not realise hidden token consumption can quietly multiply a £20 monthly bill into a £200 monthly bill without anyone changing the use case.
Where will you actually meet it?
You meet tokens in three places. The vendor pricing page is the first. Anthropic, OpenAI and Google all quote per-million-token rates with input and output broken out separately, with cached and batch pricing alongside. Claude Sonnet 4.6 sits around $3 per million input and $15 per million output in 2026, Haiku around $1 and $5, Gemini Flash near the bottom at $0.30 and $2.50.
The second place is the usage dashboard inside your vendor’s developer console. OpenAI, Anthropic and Google all show daily and monthly token counts, broken out by model and by API key. If you have not separated your AI workflows behind different API keys, the dashboard tells you how much you spent but not which use case spent it. That gap is where finance directors typically first notice the problem.
The third place is the surprise invoice. The pattern is consistent: a pilot under budget, three months of quiet expansion, and a month-four bill that needs an explanation. By that point the contract is signed, the workflows are embedded, and the conversation moves from “should we use this” to “what changed”. The honest answer is almost always one of four things, covered next.
When to ask vs when to ignore
The decision rule is volumetric. If your monthly token bill is under £100, ignore optimisation entirely and focus on whether the use case works at all. Between £100 and £500 a month, audit your system prompts and retrieval logic before considering a vendor switch. Above £500 a month, model selection, prompt caching and batch processing become real levers worth real attention. For UK SMEs running AI in 2026, the typical firm sits in the first bucket for years.
Within that frame, four hidden token sources are worth scanning for once a bill jumps. System prompts that grow over time as multiple stakeholders add instructions, doubling input cost on every call. Retry logic that re-runs failed requests, multiplying consumption silently. Retrieved context that expands from three documents per query to ten “to be safe”, tripling input. And reasoning-model thinking tokens, billed as output and often invisible on the dashboard. Each one is a small tweak. They compound.
The exception to the volumetric rule is high-volume customer-facing automation. Customer support chat, document processing, content generation at scale, all push token costs to scale with business volume rather than headcount, and the maths inverts quickly. If that is your pattern, the right question is “have we structured our prompts and retrieval to use what we are paying for”, and that is worth paying for an answer to.
Related concepts
Input and output tokens are the two sides of the bill, with output costing four to six times more per token across every major vendor. Knowing the input-output ratio for your specific workload is the first step to a credible cost forecast, and it is the number a pricing page typically omits when it advertises a single headline rate.
The context window is the maximum number of tokens a model can process in a single request. In 2026, flagship models from Anthropic, OpenAI and Google all support around a million tokens of context, enough for a 300-page book in one pass. Larger contexts are not always better, irrelevant context degrades answer quality and costs more.
Prompt caching is the architectural lever that pays back fastest at volume. Reusable context, system prompts, knowledge bases, few-shot examples, can be cached at 10 to 20 percent of the standard rate after the first request. For workloads that reuse the same context across thousands of requests, caching cuts input cost by 80 to 90 percent.
Inference cost is the broader term for what you pay every time the model runs, as opposed to training cost which the provider absorbed once. When a vendor talks about scaling their AI, they almost always mean scaling inference, and inference is what you fund.
The point of all of this is to give you enough mental model that the next vendor pricing page stops being a single number and starts being a question. Which model. What input-output mix. What cached share. What reasoning premium. The vendors who answer those questions clearly are the ones worth keeping in the conversation. If you want to talk through where your own usage sits and what to ask next, book a conversation.



