RAG vs fine-tuning: which one your firm actually needs

Two people at a desk reviewing two printed vendor quotes side by side with laptops open
TL;DR

For the typical UK SME the right starting point is RAG, retrieval-augmented generation, not fine-tuning. RAG suits dynamic knowledge, source citations and rapid iteration, with cost shape around £50 to £500 a month for a vector database plus standard inference. Fine-tuning earns its place when format consistency, very high query volume or sub-200ms latency demands it. The 2026 production pattern is hybrid: RAG for knowledge, fine-tuning for format.

Key takeaways

- RAG specialises a model on knowledge at query time. Fine-tuning specialises a model on format and behaviour at training time. - For the typical SME, RAG is the right first move. It is cheaper, faster to ship and easier to audit. - Fine-tuning earns its keep on stable domains, very high query volume, sub-200ms latency, or strict output format. - Source citations, GDPR right-to-be-forgotten and EU AI Act Article 11 documentation all favour RAG over fine-tuned weights. - The 2026 default for serious deployments is hybrid: RAG for the knowledge, a thin LoRA fine-tune for the format on top.

A 30-staff specialist construction consultancy showed me a vendor quote last month. Forty-five thousand pounds, four months, to fine-tune a Llama 70B on the firm’s project library. The technical lead asked one question. “How often is the library updated?” Weekly, came the answer. The quote was for the wrong architecture.

The right architecture, for that firm and almost every SME like it, was RAG, retrieval-augmented generation, with a thin LoRA fine-tune for format on top. Six weeks instead of four months. Ten thousand pounds total. Engineers could cite the exact source paragraph in every report. The opening question was not “RAG or fine-tuning”. It was “is the thing you want to specialise the model on knowledge, or format, or both.”

The choice you’re facing

RAG, introduced by Lewis and colleagues in 2020, keeps the foundation model fixed and serves it your knowledge at query time from a vector database. You chunk the firm’s documents, embed them, store them in Pinecone, Weaviate or AWS Bedrock Knowledge Bases, then retrieve the relevant chunks and pass them as context. The model itself is unchanged. What changes is what it is reading.

Fine-tuning adjusts the model’s weights, baking patterns into the model itself using a labelled training set. Modern parameter-efficient methods like LoRA (Hu et al. 2022) update only a small fraction of weights, so a fine-tune on a 7B model can ship for a few thousand pounds rather than tens of thousands. RAG specialises the model on knowledge at query time. Fine-tuning specialises the model on format, behaviour and reasoning patterns at training time.

That distinction is the whole game. Many firms framing this as “RAG or fine-tuning” are conflating two problems, and many vendor pitches lead with fine-tuning because the engagement is bigger. Sit with the diagnostic before you sit with the quote.

When RAG is the right answer

RAG is the right answer when knowledge changes faster than your retraining cycle, when source attribution matters, when you serve multiple domains with different document sets, or when your team needs to iterate quickly without an ML engineer in the loop. For UK SMEs in 2026, this covers the majority of first specialisation projects. The cost shape is predictable, the toolchain is mature and the compliance picture is cleaner.

The dynamic-knowledge case is the strongest. A regulated firm whose document library updates weekly cannot sustain a weekly fine-tune. Each retraining pass costs thousands and risks degrading capability on adjacent tasks. With RAG, the team updates the document and the next query reflects it.

The audit-trail case is the second strongest. The ICO’s guidance on explaining AI decisions sets the boundary clearly: a UK firm using AI in a regulated context must be able to show the information and processes behind every output. RAG returns the specific chunks used to ground a response, so the citation is part of the architecture. A fine-tuned model encodes the source across billions of weights, so “where did this answer come from” has no clean answer. For FCA-supervised firms, healthcare providers and legal practices, that gap is decisive.

The 2026 RAG ecosystem has done the heavy lifting. AWS Bedrock Knowledge Bases, Azure OpenAI On Your Data and OpenAI’s Assistants File Search offer hosted RAG where you upload documents and start querying. LangChain, LlamaIndex and Haystack give more control if you want it. Cost shape sits around £50 to £500 a month for a vector database, plus embedding and inference at standard rates. A typical SME pilot is production-ready in six to eight weeks.

When fine-tuning is the right answer

Fine-tuning is the right answer when the problem is format or behaviour rather than knowledge, when query volume is very high and per-query economics dominate, when latency rules out a retrieval round-trip, or when the domain is narrow and stable enough to justify training. These are real cases. They are rarer than vendors imply, and they almost never apply on the first specialisation pass.

The format case is the cleanest. A model that must always respond in a specific schema, with a particular tone, applying a fixed taxonomy to thousands of inputs, benefits from fine-tuning in a way that prompt engineering rarely matches. A fine-tuned smaller model, trained on a few thousand annotated examples, classifies or extracts more reliably than a larger general model with retrieved context, at a fraction of the per-query cost.

Latency is the second case. A RAG call adds 50 to 300ms of retrieval latency on top of inference. For real-time interactive use where the user feels every extra millisecond, removing that round-trip decides whether the interface feels conversational or sluggish.

The third case is unit economics. Above roughly 10 million queries a month on a narrow task, a fine-tuned 7B running on a couple of GPUs can outperform a general 70B via API at a tenth of the per-query cost. The 2026 toolchain (OpenAI fine-tuning API, AWS Bedrock fine-tuning for Claude Haiku, Hugging Face Transformers, Modal and RunPod) makes a LoRA fine-tune ship-able for £500 to £5,000 in compute. Below that volume threshold, the maths usually favours RAG.

What it costs to get wrong

Both approaches have well-documented failure modes, and they look opposite to each other. The RAG failure is hallucinated grounding: retrieval misses the right chunk and the model invents a plausible answer with a fabricated citation. The fine-tuning failure is baked-in error: training-data mistakes encoded in the weights and applied consistently. Each fails quietly enough that teams miss it until a regulator or a customer notices.

Stanford’s 2024 study of legal AI tools found commercial systems claiming “hallucination elimination” still hallucinated 17 to 33 per cent of the time, often with citations to non-existent cases. The mitigation is mechanical: precision-at-K and recall-at-K metrics, chunking tuned to the document type, hybrid keyword and semantic retrieval, reranking before the model sees the chunks, and programmatic verification of citations against retrieved content. Treat these as structural requirements, not polish items to add later.

Fine-tuning fails differently. Catastrophic forgetting is the second-order failure: a fine-tune for new tasks can erode earlier capability, with a 2024 study finding 15 to 23 per cent of attention heads severely disrupted by sequential fine-tuning. The compliance picture is sharper still. UK GDPR’s right-to-be-forgotten is straightforward in RAG (delete the document, the request is honoured) but harder for fine-tuned weights, where a deletion request implies retraining. EU AI Act Article 11 documentation requirements bite both, but harder on fine-tuned models because of training-data provenance. Cost drift is the quiet failure on both sides. Budget for the right shape, not the headline number.

What to ask before you decide

Six questions, in order, before signing anything. One, how fast does your knowledge change? Weekly favours RAG, quarterly is workable either way, annual makes fine-tuning feasible. Two, do you need source attribution and audit trails? Regulated sectors, customer disputes and any client contract requiring explainability all favour RAG. If you have to show your working, the architecture has to show its working too.

Three, are you solving a knowledge problem, a format problem, or both? “The model does not know our products” is knowledge. “The model does not write reports the way our seniors do” is format. Conflating the two is the commonest procurement mistake here. The honest answer is usually “both”.

Four, how many queries will you run, and what is your latency budget? Above 10 million queries a month on a narrow task, fine-tuning becomes interesting. Below that, RAG wins on cost. Sub-200ms latency rules out retrieval round-trips for the hot path.

Five, what is your data quality and team expertise? Fine-tuning demands clean labelled data and ML engineering capacity. RAG demands clean source documents and a lower technical ceiling. Many SMEs have the second, few have the first without external help.

Six, what is the budget shape you can sustain? RAG is OPEX-shaped, predictable, scales with use. Fine-tuning is CAPEX-shaped, with recurring spikes whenever the base model upgrades. Match the shape to how your business approves spend.

The honest answer for UK SMEs in 2026 is start with RAG, run it for three months, then evaluate whether a thin LoRA fine-tune for format earns its place on top. The hybrid pattern, named RAFT in the research, is where serious deployments tend to land. Reframe any pitch that opens with fine-tuning as the obvious answer. The interesting question is the one underneath it.

Sources

Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. The foundational RAG paper. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html Hu et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. The dominant parameter-efficient fine-tuning method cited as the cheap path to format fine-tuning. https://arxiv.org/abs/2106.09685 Zhang et al. (2024). RAFT: Adapting Language Model to Domain Specific RAG. The named hybrid pattern combining fine-tuning and retrieval. https://arxiv.org/abs/2401.08406 Anthropic (2024). Introducing Contextual Retrieval. The 2024 RAG technique improvement we recommend for serious deployments. https://www.anthropic.com/news/contextual-retrieval AWS (2026). Bedrock Knowledge Bases documentation. The managed RAG reference for SMEs without a platform engineer. https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html Microsoft (2026). Azure OpenAI On Your Data. The alternative managed RAG reference inside the Microsoft estate. https://learn.microsoft.com/en-us/azure/foundry-classic/openai/concepts/use-your-data Information Commissioner's Office (2024). Explaining decisions made with AI. The UK regulatory boundary on audit trails and source attribution. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/explaining-decisions-made-with-artificial-intelligence/ Stanford RegLab (2024). Hallucinating Law: Legal Mistakes with Large Language Models in Legal Research Tools. Documents the named RAG failure mode of hallucinated citations. https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf EU AI Act (2024). Article 11, technical documentation requirements for high-risk AI systems. The deployer documentation rule that bites fine-tuned models harder than RAG. https://artificialintelligenceact.eu/article/11/ AWS (2024). Best practices for fine-tuning Anthropic's Claude 3 Haiku on Amazon Bedrock. The named fine-tuning practitioner reference. https://aws.amazon.com/blogs/machine-learning/best-practices-and-lessons-for-fine-tuning-anthropics-claude-3-haiku-on-amazon-bedrock/

Frequently asked questions

We have been quoted £45,000 for a fine-tune on our document library. Is that the right move?

Probably not, if the library updates more often than annually. Fine-tuning bakes the documents into the model's weights, so every update means a retraining pass. RAG keeps the library hot in a vector database, costs roughly £400 a month to run, and lets engineers cite the exact paragraph in every answer. Ask the vendor to scope a RAG build first, with a thin LoRA fine-tune for format if format consistency is part of the brief.

Does fine-tuning give better answers than RAG?

Not on knowledge. A fine-tuned model is only as current as its training data, and it cannot show its working. RAG often produces better answers because the model is reading your actual documents at query time and can cite them. Fine-tuning produces better answers on format, tone and structured output, where you need the model to behave consistently across thousands of queries.

Can we use both?

Yes, and in 2026 that is the default for serious deployments. The named pattern is RAFT, retrieval-augmented fine-tuning, where the model is fine-tuned to reason well over retrieved context. In practice, many firms ship RAG first, then add a small LoRA fine-tune for format once the knowledge layer is stable. The order matters. Fine-tuning first locks in mistakes that RAG would have surfaced.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation