RAG or long context: choosing the right AI architecture

You’ve probably seen this play out. Someone on the team connects an AI assistant to a folder of internal documents, and for a small set of contracts and policy notes it answers questions well enough. Then you try to scale it: add the full service manual, two years of client briefs, a compliance library. And the answers start to drift. Wrong dates. Missed clauses. Confident responses about procedures that were revised six months ago.

The architecture choice was made implicitly, by whoever set the tool up, rather than deliberately. There are two main approaches to giving an AI system access to your business knowledge, and which one you choose shapes answer quality, cost, audit trail, and compliance exposure in ways that often only become visible once the system is already in use.

What choice are you actually making here?

RAG and long context both give an AI model access to your knowledge, but the architecture differs. With RAG, the model retrieves relevant passages from an external knowledge base at the point of answering. With long context, you load source material directly into the model’s prompt window. Both approaches look similar in demos. Their behaviour when the corpus is large or the knowledge changes frequently differs considerably.

In a retrieval system, your documents are chunked, embedded, and stored in a vector database. When someone asks a question, the system finds the most relevant chunks and passes them to the model as context. The model answers from those retrieved passages and can, in principle, point back to the source document.

Long context skips the retrieval step. You include the source documents, or a substantial portion of them, directly in the prompt. Models from Google and Anthropic now accept more than a million tokens in a single prompt, which is enough to hold entire manuals or contract packs in one go.

Neither approach is inherently better. AWS describes RAG as using “authoritative, pre-determined knowledge sources” outside the model’s training data, and that independence from the model is what gives retrieval its long-term maintenance advantage. But for a founder who needs something working quickly over a small, stable document set, loading context directly is often the faster starting point.

When does retrieval give you the stronger answer?

Retrieval is the stronger choice when your knowledge base changes frequently, when you need the system to identify which source each answer came from, or when you are handling regulated content. AWS notes that RAG lets organisations update their knowledge sources independently of the base model. That means revised policies and fresh pricing reach the system immediately. When information changes monthly, retrieval keeps answers current without rebuilding anything.

AWS also notes that RAG can improve control over generated text because the system can point to the knowledge base the answer came from. For an owner-managed business, the traceability is not a nice-to-have. When a client questions a quoted figure, when a regulator asks how a decision was reached, or when a team member challenges an AI-produced briefing, being able to point to the precise clause or document is the difference between a defensible answer and an apology.

Snowflake’s research on financial document analysis found that retrieval and chunking strategy often matters more than raw model choice for answer accuracy. Its work also found that chunk size makes a material difference: around 1,800-character chunks retrieved in quantity outperformed very large chunks, which reduced accuracy by around 10 to 20 percent.

For owner-managed businesses in regulated sectors, the ICO’s guidance makes the direction clear: organisations must be able to explain AI-driven processing and minimise data use under UK data protection law. Retrieval architectures, when built carefully, support that requirement more readily than unrestricted long-context pasting from a mixed-quality file store.

When does long context work in your favour?

Long context is at its best when your working set is small, stable, and already curated. Models from Google and Anthropic now accept over a million tokens in a single prompt, enough to hold an entire training manual or a due-diligence bundle in one go. Where documents do not change often and quick setup matters more than audit depth, the context window is the simpler choice.

Long context also has a practical role as an interim step. Before you build a full retrieval pipeline, with chunking, embedding, a vector database, permissions, and logging, you can prototype the question-answering behaviour by loading a curated set of documents into a prompt. That experiment often reveals what the knowledge base actually needs before you invest in the plumbing.

The limit on long context is not the headline token count. Vellum notes that the million-token frontier is real, but loading more documents into a window does not mean the model pays equal attention to all of them. Research from Databricks shows performance can improve up to a point and then plateau or decline as context grows.

If your use case is creative drafting rather than factual retrieval, the distinction matters less. Where you need the model to draft, summarise, or brainstorm from a curated pack of materials, long context is often adequate and the retrieval architecture is more than the task requires.

What does it actually cost to get this wrong?

The more common failure mode is answer drift rather than outright error. Databricks found that long-context performance can plateau or decline as context grows, with some models degrading at 32,000 to 64,000 tokens. For an owner-managed business using AI over internal documents, that means confidently wrong answers rather than obviously broken ones, which are harder to catch and more damaging when they reach a client.

In business terms, that can mean incorrect quotations, wrong policy advice, or missed contractual obligations, not because the model broke but because it deprioritised relevant passages in a crowded window. Databricks observed this pattern across several frontier models, not just older or smaller ones.

The regulatory cost is worth naming directly. The FCA, in its AI discussion paper, expects firms deploying AI in financial workflows to maintain explainability, accountability, and adequate governance. An architecture choice that makes it harder to identify which knowledge source produced an answer is a governance risk as well as an accuracy risk.

The NCSC’s guidance adds a security dimension: it treats prompt injection, supply chain risk, and data poisoning as design concerns rather than afterthoughts, and pushes many firms towards tighter source control rather than unrestricted context loading from mixed-quality files.

A poor architecture also compounds data hygiene problems. Retrieval and long context will both be unreliable if the source documents are inconsistent, out of date, or poorly structured. The bottleneck is often source quality, not the approach you chose.

What to ask before you commit

A handful of questions will narrow this decision quickly, and the key ones concern the knowledge itself, not the model. How often does the underlying content change? Do you need the system to identify which source each answer came from? Are you handling personal data or anything under UK data protection law? For many owner-managed businesses, those three questions settle the architecture choice before any vendor is involved.

If the knowledge changes frequently, needs to be traceable to its source document, or involves personal or regulated data, retrieval is usually the right foundation, even if it takes longer to build.

If the working set is genuinely small and stable, ask whether you need deterministic retrieval of specific passages or whether a general answer will serve. Snowflake’s findings suggest that when precision matters, retrieval quality and chunking strategy count for more than model size. When good-enough answers are acceptable and the working set is curated, long context may serve adequately.

Consider your logging and access requirements. The ICO, the FCA, and the NCSC all expect organisations handling personal or regulated data to apply proportionate controls around AI inputs and outputs. Loading mixed-quality files into a context window without access controls or redaction carries risk under either architecture, but long context can make data sprawl harder to contain.

If the EU AI Act is relevant to your business because you serve EU customers or operate AI in a regulated workflow, build logging, transparency, and human oversight into the architecture from the start, not as a retrofit.

When the questions still leave you undecided, prototype with long context first. That experiment will surface what your knowledge base needs before you invest in retrieval infrastructure.

RAG or long context: how to choose the right architecture for your business AI

Key takeaways

What choice are you actually making here?

When does retrieval give you the stronger answer?

When does long context work in your favour?

What does it actually cost to get this wrong?

What to ask before you commit

Sources

Frequently asked questions

What is the difference between RAG and long context?

When should a business use RAG instead of long context?

What happens if you feed too many documents into a long context window?

Ready to talk it through?

If any of this sounds familiar, let's talk.

RAG or long context: how to choose the right architecture for your business AI

Key takeaways

What choice are you actually making here?

When does retrieval give you the stronger answer?

When does long context work in your favour?

What does it actually cost to get this wrong?

What to ask before you commit

Sources

Frequently asked questions

What is the difference between RAG and long context?

When should a business use RAG instead of long context?

What happens if you feed too many documents into a long context window?

Ready to talk it through?

Related reading

Find the shadow AI in your agency before a client's data leaks through it

A four-tier data map so your team knows what AI can touch

Capture the shop-floor knowledge before it retires

If any of this sounds familiar, let's talk.