What is a context window? Why it matters for your business

A person at a desk reviewing printed contract pages next to an open laptop in a small office
TL;DR

A context window is the total number of tokens an AI model can see in one request: your prompt, retrieved documents, conversation history, and the model's response. Frontier models in 2026 advertise one million tokens, but independent benchmarks show effective context sits closer to 60-70% of that. For typical SME workloads, retrieving the right 5,000 to 50,000 tokens beats stuffing the full document at a fraction of the cost.

Key takeaways

- Context window is the model's working memory measured in tokens. Your prompt, your documents, the conversation history, and the model's response all share the same budget. - Frontier models advertise 1M tokens. Independent benchmarks (RULER, NoLiMa, LongBench v2) show effective context is typically 60-70% of advertised before accuracy degrades. - Doubling input length roughly quadruples compute cost on standard attention compute. A million-token request can cost 100x a 10,000-token retrieval call. - For most SME workloads, retrieve only the relevant 5,000 to 50,000 tokens. Reserve the full window for genuine cross-document synthesis where labour cost being displaced is large. - Test for silent overflow before going live. Some vendors truncate without an error when input exceeds the limit, so the application keeps running while answers degrade.

A 50-staff legal practice gets a 40-page contract from a corporate client. Manual review by a junior lawyer takes two to three hours at £150 an hour, around £300 to £450 per contract. The firm trials AI-assisted screening and runs into the architectural choice. Stuff the whole 40 pages, about 50,000 tokens, into Claude Opus 4.7’s million-token window and ask it to flag deviations from the firm’s playbook. Cost per contract: roughly 30 to 50 pence. Or feed the contract into a retrieval system that returns only the five to ten most relevant chunks, around 5,000 tokens, on each query. Cost per contract: around 5 to 10 pence.

For one contract a week, the difference is rounding error. For a hundred contracts a week, it is the difference between £30 to £50 in AI spend and £5 to £10. The technical name for the budget being spent is the context window. Both approaches fit the document, so the architectural question, retrieval or full stuffing, becomes “which delivers the answer the firm needs at the cost the firm can defend?”

What is a context window?

A context window is the total number of tokens an AI model can see in a single request. It is the model’s working memory, and the budget is shared. Your prompt, any retrieved documents, the conversation history, and the response the model generates all compete for the same space. A typical email is about 270 tokens; a 10-page report is around 4,000; a 50-page contract is roughly 25,000.

A million-token window with a 64K output reserve gives you about 936K tokens of input space. Tokens are not words. One token is roughly 0.75 English words on average, but the ratio collapses for code, JSON, and structured data, where counts inflate by 35 to 50%. The mental model that helps owners is a fixed budget per request: spend it on one giant analysis or split it across many smaller queries. Exceeding it has consequences.

Why it matters for your business

It matters because the cost relationship is non-linear and the accuracy relationship is non-obvious. Doubling input length roughly quadruples compute cost on standard attention compute, so a million-token request to Claude Opus 4.7 costs around $5 in input alone, while the same answer derived from a 10,000-token retrieval call costs around $0.05. That is a hundredfold gap on tasks where retrieval would have produced the same business outcome.

The accuracy side is the part owners miss. Independent benchmarks (RULER, NoLiMa, LongBench v2) consistently find that models advertised at 1M tokens hold reliable accuracy only to roughly 600K-700K. The “lost in the middle” phenomenon, documented by Liu et al. in 2023, shows that information sitting in the middle of a long context is attended worst, while material at the start and end fares better. For owners, the practical rule is to assume 60-70% of advertised context is the dependable working size and budget accordingly.

Google adds a structural twist. Gemini 3.1 Pro charges $2 per million input tokens up to 200K, then $4 per million above that. The marginal rate doubles at the tier boundary, which makes naive full-context approaches structurally expensive on Google. Anthropic removed long-context surcharges in March 2026, so a 900K request costs the same per-token rate as a 9K one, but the underlying compute cost is still in the bill.

Where you will meet it

You will meet context windows on every vendor pricing page in 2026. Anthropic publishes the 200K and 1M configurations for the Claude family. OpenAI lists GPT-5.5 at 1.1M tokens of input capacity. Google lists the Gemini tier boundary at 200K. Meta’s Llama 4 Scout advertises 10M tokens, which is the marketing number. Independent testing puts its effective context at roughly 1M before accuracy collapses, an 80-90% gap between claimed and dependable.

You will meet it in the way agentic tools spend money. An AI coding agent on a 50-step task re-sends the entire conversation history at every turn. Turn one might consume 5,000 input tokens. By turn 30 the cumulative input has crossed a million tokens for a single task, even though each turn only adds 3,000 to 5,000 new ones. This is what people mean when they say agentic workflows are quadratic. Prompt caching is the lever that bends it back towards linear, holding the system prompt and tool definitions at a 90% discount on cached reads.

You will meet it most painfully in silent overflow. AWS has documented this explicitly: when input exceeds the model’s hard limit, some vendors do not return an error. They truncate, the application keeps running, and answers degrade quietly. Test with worst-case input volumes before going live, and put an explicit check in your code that fails loudly rather than truncating.

When to ask vs when to ignore

Ask about context windows when the work genuinely needs cross-document reasoning. Single-pass contract review where missing a middle clause has consequences. Codebase refactoring where ripple effects must be visible across files. Multi-document strategy synthesis where the value is in connecting dots across sources. In each of these, the labour cost being displaced is much higher than the AI cost, and a 1M-token window earns its fee.

Ignore the context-window number when the work is retrieval-shaped. Chatbots and Q&A on stable documents. Structured extraction from forms and invoices. Routine data-entry automation. Real-time customer interactions where latency matters. High-volume cost-sensitive work where per-token economics dominate. In each of these, retrieval delivers around 95% of the accuracy at 5 to 10% of the cost. A 1M-token model in this slot is paying for capacity you will never use.

The vendor question worth asking is “what is your effective context on RULER or LongBench v2, and how do you handle overflow?” That is the question that separates a working product from a pricing sticker. If the answer is hand-wavy, treat the headline number as marketing.

Tokens are the atomic unit. The mapping from words to tokens is non-deterministic, and code, JSON, and structured data inflate the count by 35 to 50% relative to English prose. The full picture lives in the what is a token post.

Input vs output tokens are billed differently. Output is typically 5x more expensive per token than input on Claude Sonnet, so concise structured outputs (JSON, CSV) cost less to generate than narrative summaries. The trade-off is covered in what are input vs output tokens.

Prompt caching is the lever that turns repeated context into a 90% discount on subsequent reads. For workflows reusing a large system prompt, knowledge-base prefix, or tool definitions, it is the single biggest cost lever. See what is prompt caching for the mechanics.

Retrieval-augmented generation is the alternative architecture for retrieval-shaped workloads. Rather than stuffing the whole document, it fetches the most relevant chunks and feeds those to the model. For SMEs under £10m turnover, RAG is the default; full-context stuffing is the exception reserved for genuine synthesis tasks.

The honest framing for any owner sitting opposite a vendor pitch is this. The headline context window is half the story. The other half is effective context, the cost curve, and whether your use case actually benefits from processing more data at once. For typical SME workloads, it does not.

Sources

Anthropic (2026). Claude pricing and model configuration including 200K and 1M context windows. https://platform.claude.com/docs/en/about-claude/pricing OpenAI (2026). Introducing GPT-5.5, including frontier context configuration. https://openai.com/index/introducing-gpt-5-5/ Google (2026). Gemini API pricing, including the 200K input-token tier boundary. https://ai.google.dev/gemini-api/docs/pricing Liu et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. The canonical paper on position-dependent attention degradation. https://arxiv.org/abs/2307.03172 Hsieh et al. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models? UC Berkeley benchmark for long-context retrieval and reasoning. https://arxiv.org/abs/2404.06654 Adobe Research (2025). NoLiMa: Long-Context Evaluation Beyond Literal Matching. Harder long-context benchmark with minimal lexical overlap. https://github.com/adobe-research/NoLiMa AWS (2024). Context window overflow: breaking the barrier. The truncation failure mode named in production. https://aws.amazon.com/blogs/security/context-window-overflow-breaking-the-barrier/ Atlan (2025). LLM context window limitations: effective context, metadata degradation, and enterprise overhead. https://atlan.com/know/llm-context-window-limitations/ Pinecone (2024). Why retrieval beats stuffing for most production workloads. https://www.pinecone.io/blog/why-use-retrieval-instead-of-larger-context/

Frequently asked questions

How big a context window do I actually need?

For chatbots, Q&A, extraction, and routine automation, a 128K-token window is usually plenty when paired with retrieval. Reserve 1M-token windows for one-shot deep work where the model genuinely has to reason across the whole document, single-pass contract review, codebase refactoring, multi-document strategy synthesis. The size of the window matters less than the architecture you wrap around it.

Why does bigger cost so much more?

Standard attention compute scales roughly quadratically with input length, so doubling the tokens roughly quadruples the compute. On top of that, some vendors add tier pricing above 200K tokens, which means the rate per token literally doubles past the boundary. A million-token Claude Opus request can cost around five dollars in input alone, while a focused 10,000-token retrieval call costs a few pence for the same business answer.

What is "effective context" and why does it matter?

Effective context is the size at which the model still answers reliably, measured by independent benchmarks like RULER, NoLiMa, and LongBench v2. Across frontier models tested in 2025, effective context sits 15 to 40% below the advertised maximum. So a 1M-token model is dependable to roughly 600K-700K tokens. Budget on the effective number, not the headline.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation