RAG or long context: how to choose the right architecture for your business AI

Two colleagues reviewing documents on a laptop in a naturally lit office
TL;DR

RAG retrieves relevant passages from an external knowledge base at the point of answering, keeping responses grounded in current content. Long context loads source material directly into the model's prompt window, which is simpler to prototype but can lose accuracy as the corpus grows. For owner-managed businesses, the choice turns on how often your knowledge changes, whether you need a clear audit trail, and what your data protection obligations require.

Key takeaways

- RAG retrieves knowledge from an external source at inference time, keeping answers current as the knowledge base is updated independently of the model. - Long context loads source material directly into the model's prompt window, which is simpler to start but can degrade in accuracy as the corpus grows larger. - Databricks benchmark testing found that some models begin to lose performance at 32,000 to 64,000 tokens, making large-corpus long context a reliability risk in production. - Snowflake's research shows that retrieval and chunking strategy often matters more than raw model size for document question-and-answer accuracy. - UK data protection law, FCA governance requirements, and NCSC security guidance all apply to either architecture when personal data or regulated content is involved.

You’ve probably seen this play out. Someone on the team connects an AI assistant to a folder of internal documents, and for a small set of contracts and policy notes it answers questions well enough. Then you try to scale it: add the full service manual, two years of client briefs, a compliance library. And the answers start to drift. Wrong dates. Missed clauses. Confident responses about procedures that were revised six months ago.

The architecture choice was made implicitly, by whoever set the tool up, rather than deliberately. There are two main approaches to giving an AI system access to your business knowledge, and which one you choose shapes answer quality, cost, audit trail, and compliance exposure in ways that often only become visible once the system is already in use.

What choice are you actually making here?

RAG and long context both give an AI model access to your knowledge, but the architecture differs. With RAG, the model retrieves relevant passages from an external knowledge base at the point of answering. With long context, you load source material directly into the model’s prompt window. Both approaches look similar in demos. Their behaviour when the corpus is large or the knowledge changes frequently differs considerably.

In a retrieval system, your documents are chunked, embedded, and stored in a vector database. When someone asks a question, the system finds the most relevant chunks and passes them to the model as context. The model answers from those retrieved passages and can, in principle, point back to the source document.

Long context skips the retrieval step. You include the source documents, or a substantial portion of them, directly in the prompt. Models from Google and Anthropic now accept more than a million tokens in a single prompt, which is enough to hold entire manuals or contract packs in one go.

Neither approach is inherently better. AWS describes RAG as using “authoritative, pre-determined knowledge sources” outside the model’s training data, and that independence from the model is what gives retrieval its long-term maintenance advantage. But for a founder who needs something working quickly over a small, stable document set, loading context directly is often the faster starting point.

When does retrieval give you the stronger answer?

Retrieval is the stronger choice when your knowledge base changes frequently, when you need the system to identify which source each answer came from, or when you are handling regulated content. AWS notes that RAG lets organisations update their knowledge sources independently of the base model. That means revised policies and fresh pricing reach the system immediately. When information changes monthly, retrieval keeps answers current without rebuilding anything.

AWS also notes that RAG can improve control over generated text because the system can point to the knowledge base the answer came from. For an owner-managed business, that matters more than it might appear. When a client questions a quoted figure, when a regulator asks how a decision was reached, or when a team member challenges an AI-produced briefing, being able to point to the precise clause or document is the difference between a defensible answer and an apology.

Snowflake’s research on financial document analysis found that retrieval and chunking strategy often matters more than raw model choice for answer accuracy. Its work also found that chunk size makes a material difference: around 1,800-character chunks retrieved in quantity outperformed very large chunks, which reduced accuracy by around 10 to 20 percent.

For owner-managed businesses in regulated sectors, the ICO’s guidance makes the direction clear: organisations must be able to explain AI-driven processing and minimise data use under UK data protection law. Retrieval architectures, when built carefully, support that requirement more readily than unrestricted long-context pasting from a mixed-quality file store.

When does long context work in your favour?

Long context is at its best when your working set is small, stable, and already curated. Models from Google and Anthropic now accept over a million tokens in a single prompt, enough to hold an entire training manual or a due-diligence bundle in one go. Where documents do not change often and quick setup matters more than audit depth, the context window is the simpler choice.

Long context also has a practical role as an interim step. Before you build a full retrieval pipeline, with chunking, embedding, a vector database, permissions, and logging, you can prototype the question-answering behaviour by loading a curated set of documents into a prompt. That experiment often reveals what the knowledge base actually needs before you invest in the plumbing.

The limit on long context is not the headline token count. Vellum notes that the million-token frontier is real, but loading more documents into a window does not mean the model pays equal attention to all of them. Research from Databricks shows performance can improve up to a point and then plateau or decline as context grows.

If your use case is creative drafting rather than factual retrieval, the distinction matters less. Where you need the model to draft, summarise, or brainstorm from a curated pack of materials, long context is often adequate and the retrieval architecture is more than the task requires.

What does it actually cost to get this wrong?

The more common failure mode is answer drift rather than outright error. Databricks found that long-context performance can plateau or decline as context grows, with some models degrading at 32,000 to 64,000 tokens. For an owner-managed business using AI over internal documents, that means confidently wrong answers rather than obviously broken ones, which are harder to catch and more damaging when they reach a client.

In business terms, that can mean incorrect quotations, wrong policy advice, or missed contractual obligations, not because the model broke but because it quietly deprioritised relevant passages in a crowded window. Databricks observed this pattern across several frontier models, not just older or smaller ones.

The regulatory cost is worth naming directly. The FCA, in its AI discussion paper, expects firms deploying AI in financial workflows to maintain explainability, accountability, and adequate governance. An architecture choice that makes it harder to identify which knowledge source produced an answer is a governance risk as well as an accuracy risk.

The NCSC’s guidance adds a security dimension: it treats prompt injection, supply chain risk, and data poisoning as design concerns rather than afterthoughts, and pushes many firms towards tighter source control rather than unrestricted context loading from mixed-quality files.

A poor architecture also compounds data hygiene problems. Retrieval and long context will both be unreliable if the source documents are inconsistent, out of date, or poorly structured. The bottleneck is often source quality, not the approach you chose.

What to ask before you commit

A handful of questions will narrow this decision quickly, and the key ones concern the knowledge itself, not the model. How often does the underlying content change? Do you need the system to identify which source each answer came from? Are you handling personal data or anything under UK data protection law? For many owner-managed businesses, those three questions settle the architecture choice before any vendor is involved.

If the knowledge changes frequently, needs to be traceable to its source document, or involves personal or regulated data, retrieval is usually the right foundation, even if it takes longer to build.

If the working set is genuinely small and stable, ask whether you need deterministic retrieval of specific passages or whether a general answer will serve. Snowflake’s findings suggest that when precision matters, retrieval quality and chunking strategy count for more than model size. When good-enough answers are acceptable and the working set is curated, long context may serve adequately.

Consider your logging and access requirements. The ICO, the FCA, and the NCSC all expect organisations handling personal or regulated data to apply proportionate controls around AI inputs and outputs. Loading mixed-quality files into a context window without access controls or redaction carries risk under either architecture, but long context can make data sprawl harder to contain.

If the EU AI Act is relevant to your business because you serve EU customers or operate AI in a regulated workflow, build logging, transparency, and human oversight into the architecture from the start, not as a retrofit.

When the questions still leave you undecided, prototype with long context first. That experiment will surface what your knowledge base needs before you invest in retrieval infrastructure.

Sources

- AWS (2024). What is RAG? Explains retrieval-augmented generation as using external authoritative knowledge sources independently of the base model. https://aws.amazon.com/what-is/retrieval-augmented-generation/ - Vellum (2024). RAG vs Long Context. Covers practical differences between retrieval and long-context approaches, including 1M+ token model capabilities from Google and Anthropic. https://www.vellum.ai/blog/rag-vs-long-context - Databricks (2024). Long Context RAG Performance of LLMs. Benchmark study showing long-context performance can decline at 32k to 64k tokens for several frontier models. https://www.databricks.com/blog/long-context-rag-performance-llms - Snowflake (2024). Long-Context Isn't All You Need: How Retrieval and Chunking Impact Financial RAG. Engineering case study showing chunking strategy can outperform raw model capacity; approximately 1,800-character chunks outperformed very large chunks by 10 to 20 percent. https://www.snowflake.com/en/blog/engineering/impact-retrieval-chunking-finance-rag/ - ICO (2023-2024). AI and data protection guidance hub. ICO guidance on explaining AI-driven processing and data minimisation under UK GDPR. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ - ICO (2023-2024). Guidance on AI and data protection. Sets out organisations' duties to explain AI processing and apply appropriate governance when personal data is used. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/ - FCA (2024). Artificial Intelligence discussion paper DP5/24. Sets out FCA expectations on explainability, accountability, governance, and data quality for AI in financial services. https://www.fca.org.uk/publication/discussion/dp5-24.pdf - NCSC (2024). Using AI securely in your organisation. Covers prompt injection, supply chain risk, data poisoning, and proportionate controls around AI data inputs and outputs. https://www.ncsc.gov.uk/collection/artificial-intelligence/using-ai-securely-in-your-organisation - EUR-Lex (2024). Regulation (EU) 2024/1689 (AI Act). Introduces obligations for high-risk AI systems including data governance, logging, transparency, and human oversight duties. https://eur-lex.europa.eu/eli/reg/2024/1689/oj

Frequently asked questions

What is the difference between RAG and long context?

RAG retrieves relevant passages from an external knowledge base at the point of answering, so the model works from what it found rather than everything loaded in advance. Long context loads source documents directly into the model's prompt window before the question is asked. RAG handles large, frequently-updated corpora better. Long context suits small, stable document sets where quick setup matters more than freshness or audit depth.

When should a business use RAG instead of long context?

Use retrieval when your knowledge changes frequently, when you need the system to identify which source each answer came from, or when you are handling personal data or regulated content under UK data protection or financial services rules. Snowflake's research found that chunking and retrieval strategy often matters more than model size for document accuracy, and AWS notes that RAG lets you update knowledge independently of the underlying model.

What happens if you feed too many documents into a long context window?

Performance can plateau and then decline. Databricks found that some models degrade at around 32,000 to 64,000 tokens in the context window. In practice, answers can become confidently wrong rather than obviously broken, which is harder to detect. For an owner-managed business using AI over internal documents, that kind of drift can produce incorrect policy advice or missed contract obligations before anyone notices.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation