RAG versus large context windows: which suits your business?

A business owner at a desk reviewing documents on a laptop with a folder of papers nearby
TL;DR

RAG and large context windows solve different parts of the document-AI problem. RAG retrieves relevant sections from a large or changing knowledge base; large context windows load the whole document at once so the model reasons across it in full. For owner-managed businesses, the deciding factors are the size and stability of your document library, whether your outputs need traceable citations, and what an incorrect answer costs your firm.

Key takeaways

- RAG retrieves relevant document sections before passing them to the AI model; large context windows load the full document at once. The right choice depends on the size and stability of your document library and how fast it changes. - IBM's practice guidance indicates that loading whole document libraries into prompts can consume an order of magnitude more tokens than RAG retrieval, making large context significantly more expensive as the library grows. - Large context windows suit small, stable document sets and tasks that require the model to read the full document, such as contract review or board pack summarisation. - The ICO's 2024 AI and data protection guidance reinforces the importance of accuracy and traceability in AI use; RAG's ability to log which document sections were retrieved typically makes it easier to audit and explain. - Before committing to either approach, answer five questions: how many documents do you have, how fast does the library change, do outputs need citations, what is the cost of a wrong answer, and can your team maintain document hygiene over time?

Your firm uses contracts, policies, or knowledge documents. An AI vendor is pitching you an assistant that can answer questions about them. Sounds straightforward. Then they ask which mode you want: a system that loads your entire document library into the model’s context every time, or one that searches and retrieves the relevant sections on demand. If you haven’t come across RAG or large context windows before, that question is nearly impossible to answer well. This guide covers the practical trade-offs without the architecture lecture.

What is the choice you’re actually facing?

RAG stands for retrieval-augmented generation. It finds the relevant sections from your document library first, then hands only those sections to the AI model. A large context window takes a different approach: the model reads a significant portion of your documents all at once, without a retrieval step. Both let AI answer questions about your business’s own documents. The difference is how that material gets in front of the model.

The debate around these two approaches has grown louder as context windows have expanded substantially. Google’s Gemini 1.5 supports windows of over one million tokens, which Dataiku converts to roughly 1,500 pages of text. That headline number makes RAG sound unnecessary. In practice, the effective context is often considerably lower than the advertised limit. IBM’s practice guidance notes that a model marketed at 128,000 tokens may have a working effective context of only 30,000 to 50,000 tokens for real-world tasks. That gap matters when you are deciding whether to pour your entire document library into every prompt. The right choice depends on the size and stability of your library, whether your outputs need traceable source references, and what an incorrect answer costs you.

When does RAG make more sense for your business?

RAG is the better fit when your knowledge base is large, changes frequently, or covers a mixed library of policies, contracts, and internal guidance. It is also the right architecture when your team needs to point to the specific paragraph or source that informed an answer. Cohere describes RAG as the more controllable and scalable pattern where precision and source traceability matter.

The case for RAG strengthens the more client-facing or compliance-sensitive the output is. If an AI assistant gives a customer incorrect information because it drew from a poorly assembled context with no retrieval discipline, the consequences range from embarrassing to legally significant. The 2024 Air Canada chatbot ruling, where a Canadian tribunal held the airline responsible for misinformation from its own chatbot, illustrates that exposure clearly. RAG gives you a cleaner answer to “where did this come from?” because it logs which document sections were retrieved and surfaced to the model.

Token costs are a further consideration. IBM’s comparison suggests injecting an entire document set into every prompt can use an order of magnitude more tokens than a RAG approach. When you load the whole document library, the token count for every query scales with the library size rather than with the question’s complexity. A retrieval approach keeps the per-query cost roughly constant because only the relevant sections are passed in. For an owner-managed business watching API spend, a well-structured RAG setup typically becomes the more economical choice once the library grows beyond a few dozen files.

When do large context windows make more sense?

Large context windows suit situations where your document set is small and stable, and you want the model to read and reason across all of it at once. Summarising a board pack, reviewing a single contract end to end, or analysing a short policy document are tasks where loading the full text tends to produce sharper, more coherent output than chunked retrieval.

Unstructured’s analysis notes that long-context models can be a better fit when queries are repetitive and the document set is small, because you skip the complexity of retrieval pipelines, chunking strategies, and metadata tuning. An accountant reviewing a single client’s year-end accounts, a solicitor working through one property transaction pack, or an owner summarising a short report all fit this pattern.

That simplicity is the genuine advantage for small operations. You paste the file, ask your question, and the answer comes back without needing a retrieval pipeline, a metadata schema, or a chunking strategy in place. For a small team with limited technical capacity, this matters at the outset.

The caveat is that large context approaches do not scale well. Dataiku flags increased latency and processing costs as document volume grows. A model that handles one 60-page document well may struggle with fifty. If you expect your knowledge base to expand, retrofitting a retrieval layer later is harder than designing for it from the start.

What does it cost to get this wrong?

The cost of the wrong choice depends on your use case. Choose long-context for a growing document library and you will face token costs that scale badly and latency that degrades as files accumulate. Choose RAG without investing in document hygiene and metadata and the retrieval step underperforms, surfacing wrong sections or missing relevant material entirely, which is often worse than no system at all.

For owner-managed businesses in regulated sectors, the stakes are higher than for internal experimentation. The ICO’s 2024 guidance on AI and data protection stresses accuracy, transparency, and accountability in AI use, which means being able to explain where an AI-generated answer came from. If a client receives incorrect information because the model drew from a poorly constructed prompt with no retrieval discipline, the governance exposure sits with your business.

The Samsung incident in 2023 illustrated a related risk: employees pasted internal source code and meeting notes into ChatGPT, exposing sensitive business data. The architecture question sits alongside the data security question. Both RAG and large-context approaches can expose sensitive data if access controls, logging, and data handling discipline are not in place. The choice of architecture does not substitute for those controls.

What should you ask before you commit?

Five questions will clarify which architecture fits your situation. How many documents does your library contain, and how often does it change? Do your AI outputs need to reference specific sources? What is the cost if an answer is wrong? Can your team maintain document hygiene and metadata over time? And if you are serving EU clients or using EU-deployed AI tools, what compliance obligations apply?

If your document library is small and relatively stable, and you need the model to read across whole documents in full rather than find specific answers within them, long-context is a reasonable starting point. If your knowledge base spans more than a few dozen documents, is growing, or covers areas where citations matter for compliance or client communication, RAG is usually the safer long-term choice.

The UK regulatory framing is consistent. The ICO’s guidance on AI and data protection emphasises data minimisation: feeding an entire document library into every prompt may raise questions about whether you are processing more data than the task requires. The EU AI Act, adopted in 2024, creates additional obligations around governance and transparency for UK firms with EU operations. The NCSC and the FCA have both published guidance on AI governance that points in the same direction. The architecture choice should be deliberate and documented, not made by default when the vendor asks which mode you want.

If you’re working through this decision for a specific knowledge base or use case, book a conversation and we can map the options to your actual setup.

Sources

- ICO (2024). AI and data protection. ICO guidance covering accuracy, transparency, data minimisation, and accountability obligations for UK organisations deploying AI systems. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ai-and-data-protection/ - NCSC (2024). Guidance on securing artificial intelligence. Practical security guidance for UK organisations adopting AI, including supply-chain and data-handling considerations. https://www.ncsc.gov.uk/guidance/securing-artificial-intelligence - FCA (2024). Artificial intelligence in financial services. FCA expectations around AI governance and operational resilience for regulated firms. https://www.fca.org.uk/firms/artificial-intelligence - European Union (2024). EU AI Act (Regulation 2024/1689). Governance, transparency, and risk management obligations for AI systems, relevant to UK firms with EU operations or EU-deployed tools. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689 - Reuters (2023). Samsung bans generative AI use by staff after ChatGPT data leak. Reports on internal data exposure when employees pasted source code and meeting notes into ChatGPT. https://www.reuters.com/world/asia-pacific/samsung-bans-generative-ai-use-staff-after-chatgpt-data-leak-2023-05-02/ - British Columbia Civil Resolution Tribunal (2024). Moffatt v Air Canada, 2024 BCCT 74. Ruling holding Air Canada liable for misinformation provided by its AI chatbot in a customer-facing context. https://www.canlii.org/en/bc/bcct/doc/2024/2024bcct74/2024bcct74.html - Dataiku (2024). Is RAG obsolete? Analysis of when RAG outperforms long-context models and vice versa, covering cost, latency, and citation trade-offs. https://www.dataiku.com/stories/blog/is-rag-obsolete - Cohere (2024). RAG is here to stay. Explanation of why RAG remains a valuable source-of-truth pattern with better controllability and relevance targeting than direct context injection. https://cohere.com/blog/rag-is-here-to-stay - Unstructured (2024). RAG vs long-context models: do we still need RAG? Analysis of when long-context may complement or replace RAG for different document and query scenarios. https://unstructured.io/blog/rag-vs-long-context-models-do-we-still-need-rag - IBM (2024). RAG versus direct context: a practical comparison. Practice guidance comparing token costs, effective context limits, and retrieval trade-offs for enterprise document use cases. https://www.youtube.com/watch?v=MEOh5fdBWWs&vl=en-US

Frequently asked questions

What is the difference between RAG and a large context window?

RAG retrieval-augmented generation finds the relevant sections from your document library first, then feeds only those sections to the AI model. A large context window loads a significant portion of your documents directly into the model at once, without a retrieval step. RAG is generally better for large or changing knowledge bases where citations matter; large context suits small, stable document sets where reasoning across the whole file in full is the priority.

Is RAG more expensive than using a large context window?

It depends on the size of your document library. IBM's practice guidance indicates that loading an entire document set into every prompt can consume an order of magnitude more tokens than a RAG approach, which retrieves only the relevant sections. For small document sets, large context may cost less. As the library grows, RAG typically becomes the more economical architecture.

Do UK businesses have specific obligations when choosing between RAG and large context windows?

No regulation mandates one approach over the other, but the ICO's AI and data protection guidance stresses accuracy, transparency, and data minimisation. Both choices carry governance implications. RAG's ability to log which document sections informed a response tends to be easier to audit and explain to regulators. For firms with EU operations, the EU AI Act adds further governance and transparency requirements.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation