A 50-staff legal practice gets a 40-page contract from a corporate client. Manual review by a junior lawyer takes two to three hours at £150 an hour, around £300 to £450 per contract. The firm trials AI-assisted screening and runs into the architectural choice. Stuff the whole 40 pages, about 50,000 tokens, into Claude Opus 4.7’s million-token window and ask it to flag deviations from the firm’s playbook. Cost per contract: roughly 30 to 50 pence. Or feed the contract into a retrieval system that returns only the five to ten most relevant chunks, around 5,000 tokens, on each query. Cost per contract: around 5 to 10 pence.
For one contract a week, the difference is rounding error. For a hundred contracts a week, it is the difference between £30 to £50 in AI spend and £5 to £10. The technical name for the budget being spent is the context window. Both approaches fit the document, so the architectural question, retrieval or full stuffing, becomes “which delivers the answer the firm needs at the cost the firm can defend?”
What is a context window?
A context window is the total number of tokens an AI model can see in a single request. It is the model’s working memory, and the budget is shared. Your prompt, any retrieved documents, the conversation history, and the response the model generates all compete for the same space. A typical email is about 270 tokens; a 10-page report is around 4,000; a 50-page contract is roughly 25,000.
A million-token window with a 64K output reserve gives you about 936K tokens of input space. Tokens are not words. One token is roughly 0.75 English words on average, but the ratio collapses for code, JSON, and structured data, where counts inflate by 35 to 50%. The mental model that helps owners is a fixed budget per request: spend it on one giant analysis or split it across many smaller queries. Exceeding it has consequences.
Why it matters for your business
It matters because the cost relationship is non-linear and the accuracy relationship is non-obvious. Doubling input length roughly quadruples compute cost on standard attention compute, so a million-token request to Claude Opus 4.7 costs around $5 in input alone, while the same answer derived from a 10,000-token retrieval call costs around $0.05. That is a hundredfold gap on tasks where retrieval would have produced the same business outcome.
The accuracy side is the part owners miss. Independent benchmarks (RULER, NoLiMa, LongBench v2) consistently find that models advertised at 1M tokens hold reliable accuracy only to roughly 600K-700K. The “lost in the middle” phenomenon, documented by Liu et al. in 2023, shows that information sitting in the middle of a long context is attended worst, while material at the start and end fares better. For owners, the practical rule is to assume 60-70% of advertised context is the dependable working size and budget accordingly.
Google adds a structural twist. Gemini 3.1 Pro charges $2 per million input tokens up to 200K, then $4 per million above that. The marginal rate doubles at the tier boundary, which makes naive full-context approaches structurally expensive on Google. Anthropic removed long-context surcharges in March 2026, so a 900K request costs the same per-token rate as a 9K one, but the underlying compute cost is still in the bill.
Where you will meet it
You will meet context windows on every vendor pricing page in 2026. Anthropic publishes the 200K and 1M configurations for the Claude family. OpenAI lists GPT-5.5 at 1.1M tokens of input capacity. Google lists the Gemini tier boundary at 200K. Meta’s Llama 4 Scout advertises 10M tokens, which is the marketing number. Independent testing puts its effective context at roughly 1M before accuracy collapses, an 80-90% gap between claimed and dependable.
You will meet it in the way agentic tools spend money. An AI coding agent on a 50-step task re-sends the entire conversation history at every turn. Turn one might consume 5,000 input tokens. By turn 30 the cumulative input has crossed a million tokens for a single task, even though each turn only adds 3,000 to 5,000 new ones. This is what people mean when they say agentic workflows are quadratic. Prompt caching is the lever that bends it back towards linear, holding the system prompt and tool definitions at a 90% discount on cached reads.
You will meet it most painfully in silent overflow. AWS has documented this explicitly: when input exceeds the model’s hard limit, some vendors do not return an error. They truncate, the application keeps running, and answers degrade quietly. Test with worst-case input volumes before going live, and put an explicit check in your code that fails loudly rather than truncating.
When to ask vs when to ignore
Ask about context windows when the work genuinely needs cross-document reasoning. Single-pass contract review where missing a middle clause has consequences. Codebase refactoring where ripple effects must be visible across files. Multi-document strategy synthesis where the value is in connecting dots across sources. In each of these, the labour cost being displaced is much higher than the AI cost, and a 1M-token window earns its fee.
Ignore the context-window number when the work is retrieval-shaped. Chatbots and Q&A on stable documents. Structured extraction from forms and invoices. Routine data-entry automation. Real-time customer interactions where latency matters. High-volume cost-sensitive work where per-token economics dominate. In each of these, retrieval delivers around 95% of the accuracy at 5 to 10% of the cost. A 1M-token model in this slot is paying for capacity you will never use.
The vendor question worth asking is “what is your effective context on RULER or LongBench v2, and how do you handle overflow?” That is the question that separates a working product from a pricing sticker. If the answer is hand-wavy, treat the headline number as marketing.
Related concepts
Tokens are the atomic unit. The mapping from words to tokens is non-deterministic, and code, JSON, and structured data inflate the count by 35 to 50% relative to English prose. The full picture lives in the what is a token post.
Input vs output tokens are billed differently. Output is typically 5x more expensive per token than input on Claude Sonnet, so concise structured outputs (JSON, CSV) cost less to generate than narrative summaries. The trade-off is covered in what are input vs output tokens.
Prompt caching is the lever that turns repeated context into a 90% discount on subsequent reads. For workflows reusing a large system prompt, knowledge-base prefix, or tool definitions, it is the single biggest cost lever. See what is prompt caching for the mechanics.
Retrieval-augmented generation is the alternative architecture for retrieval-shaped workloads. Rather than stuffing the whole document, it fetches the most relevant chunks and feeds those to the model. For SMEs under £10m turnover, RAG is the default; full-context stuffing is the exception reserved for genuine synthesis tasks.
The honest framing for any owner sitting opposite a vendor pitch is this. The headline context window is half the story. The other half is effective context, the cost curve, and whether your use case actually benefits from processing more data at once. For typical SME workloads, it does not.



