A 30-staff specialist construction consultancy showed me a vendor quote last month. Forty-five thousand pounds, four months, to fine-tune a Llama 70B on the firm’s project library. The technical lead asked one question. “How often is the library updated?” Weekly, came the answer. The quote was for the wrong architecture.
The right architecture, for that firm and almost every SME like it, was RAG, retrieval-augmented generation, with a thin LoRA fine-tune for format on top. Six weeks instead of four months. Ten thousand pounds total. Engineers could cite the exact source paragraph in every report. The opening question was not “RAG or fine-tuning”. It was “is the thing you want to specialise the model on knowledge, or format, or both.”
The choice you’re facing
RAG, introduced by Lewis and colleagues in 2020, keeps the foundation model fixed and serves it your knowledge at query time from a vector database. You chunk the firm’s documents, embed them, store them in Pinecone, Weaviate or AWS Bedrock Knowledge Bases, then retrieve the relevant chunks and pass them as context. The model itself is unchanged. What changes is what it is reading.
Fine-tuning adjusts the model’s weights, baking patterns into the model itself using a labelled training set. Modern parameter-efficient methods like LoRA (Hu et al. 2022) update only a small fraction of weights, so a fine-tune on a 7B model can ship for a few thousand pounds rather than tens of thousands. RAG specialises the model on knowledge at query time. Fine-tuning specialises the model on format, behaviour and reasoning patterns at training time.
That distinction is the whole game. Many firms framing this as “RAG or fine-tuning” are conflating two problems, and many vendor pitches lead with fine-tuning because the engagement is bigger. Sit with the diagnostic before you sit with the quote.
When RAG is the right answer
RAG is the right answer when knowledge changes faster than your retraining cycle, when source attribution matters, when you serve multiple domains with different document sets, or when your team needs to iterate quickly without an ML engineer in the loop. For UK SMEs in 2026, this covers the majority of first specialisation projects. The cost shape is predictable, the toolchain is mature and the compliance picture is cleaner.
The dynamic-knowledge case is the strongest. A regulated firm whose document library updates weekly cannot sustain a weekly fine-tune. Each retraining pass costs thousands and risks degrading capability on adjacent tasks. With RAG, the team updates the document and the next query reflects it.
The audit-trail case is the second strongest. The ICO’s guidance on explaining AI decisions sets the boundary clearly: a UK firm using AI in a regulated context must be able to show the information and processes behind every output. RAG returns the specific chunks used to ground a response, so the citation is part of the architecture. A fine-tuned model encodes the source across billions of weights, so “where did this answer come from” has no clean answer. For FCA-supervised firms, healthcare providers and legal practices, that gap is decisive.
The 2026 RAG ecosystem has done the heavy lifting. AWS Bedrock Knowledge Bases, Azure OpenAI On Your Data and OpenAI’s Assistants File Search offer hosted RAG where you upload documents and start querying. LangChain, LlamaIndex and Haystack give more control if you want it. Cost shape sits around £50 to £500 a month for a vector database, plus embedding and inference at standard rates. A typical SME pilot is production-ready in six to eight weeks.
When fine-tuning is the right answer
Fine-tuning is the right answer when the problem is format or behaviour rather than knowledge, when query volume is very high and per-query economics dominate, when latency rules out a retrieval round-trip, or when the domain is narrow and stable enough to justify training. These are real cases. They are rarer than vendors imply, and they almost never apply on the first specialisation pass.
The format case is the cleanest. A model that must always respond in a specific schema, with a particular tone, applying a fixed taxonomy to thousands of inputs, benefits from fine-tuning in a way that prompt engineering rarely matches. A fine-tuned smaller model, trained on a few thousand annotated examples, classifies or extracts more reliably than a larger general model with retrieved context, at a fraction of the per-query cost.
Latency is the second case. A RAG call adds 50 to 300ms of retrieval latency on top of inference. For real-time interactive use where the user feels every extra millisecond, removing that round-trip decides whether the interface feels conversational or sluggish.
The third case is unit economics. Above roughly 10 million queries a month on a narrow task, a fine-tuned 7B running on a couple of GPUs can outperform a general 70B via API at a tenth of the per-query cost. The 2026 toolchain (OpenAI fine-tuning API, AWS Bedrock fine-tuning for Claude Haiku, Hugging Face Transformers, Modal and RunPod) makes a LoRA fine-tune ship-able for £500 to £5,000 in compute. Below that volume threshold, the maths usually favours RAG.
What it costs to get wrong
Both approaches have well-documented failure modes, and they look opposite to each other. The RAG failure is hallucinated grounding: retrieval misses the right chunk and the model invents a plausible answer with a fabricated citation. The fine-tuning failure is baked-in error: training-data mistakes encoded in the weights and applied consistently. Each fails quietly enough that teams miss it until a regulator or a customer notices.
Stanford’s 2024 study of legal AI tools found commercial systems claiming “hallucination elimination” still hallucinated 17 to 33 per cent of the time, often with citations to non-existent cases. The mitigation is mechanical: precision-at-K and recall-at-K metrics, chunking tuned to the document type, hybrid keyword and semantic retrieval, reranking before the model sees the chunks, and programmatic verification of citations against retrieved content. Treat these as structural requirements, not polish items to add later.
Fine-tuning fails differently. Catastrophic forgetting is the second-order failure: a fine-tune for new tasks can erode earlier capability, with a 2024 study finding 15 to 23 per cent of attention heads severely disrupted by sequential fine-tuning. The compliance picture is sharper still. UK GDPR’s right-to-be-forgotten is straightforward in RAG (delete the document, the request is honoured) but harder for fine-tuned weights, where a deletion request implies retraining. EU AI Act Article 11 documentation requirements bite both, but harder on fine-tuned models because of training-data provenance. Cost drift is the quiet failure on both sides. Budget for the right shape, not the headline number.
What to ask before you decide
Six questions, in order, before signing anything. One, how fast does your knowledge change? Weekly favours RAG, quarterly is workable either way, annual makes fine-tuning feasible. Two, do you need source attribution and audit trails? Regulated sectors, customer disputes and any client contract requiring explainability all favour RAG. If you have to show your working, the architecture has to show its working too.
Three, are you solving a knowledge problem, a format problem, or both? “The model does not know our products” is knowledge. “The model does not write reports the way our seniors do” is format. Conflating the two is the commonest procurement mistake here. The honest answer is usually “both”.
Four, how many queries will you run, and what is your latency budget? Above 10 million queries a month on a narrow task, fine-tuning becomes interesting. Below that, RAG wins on cost. Sub-200ms latency rules out retrieval round-trips for the hot path.
Five, what is your data quality and team expertise? Fine-tuning demands clean labelled data and ML engineering capacity. RAG demands clean source documents and a lower technical ceiling. Many SMEs have the second, few have the first without external help.
Six, what is the budget shape you can sustain? RAG is OPEX-shaped, predictable, scales with use. Fine-tuning is CAPEX-shaped, with recurring spikes whenever the base model upgrades. Match the shape to how your business approves spend.
The honest answer for UK SMEs in 2026 is start with RAG, run it for three months, then evaluate whether a thin LoRA fine-tune for format earns its place on top. The hybrid pattern, named RAFT in the research, is where serious deployments tend to land. Reframe any pitch that opens with fine-tuning as the obvious answer. The interesting question is the one underneath it.



