A consulting firm I work with had been quoted £18,000 by a vendor to “fine-tune the model on your data”. The pitch was that the system would understand their methodology, pick up their tone, and learn the way they wrote client reports. The managing partner asked me whether it was worth it. I asked her how many reports a month they wrote, and how many examples she could find of reports she would be proud to teach a model with. The first answer was twelve. The second was honest: maybe forty.
That conversation is the actual question behind fine-tuning. The technique is real. The cost has dropped. The business case is narrower than the pitch suggests.
What is fine-tuning?
Fine-tuning is the process of retraining a pre-trained AI model on your own examples so that its internal parameters shift to favour your patterns. The base model arrives knowing English and knowing how to write. Fine-tuning teaches it to write more like your firm specifically, classify tickets the way your operations team would, or produce output in a format your downstream systems need. The model keeps its general capability and acquires a specialised layer on top.
It sits between two cheaper options. Prompt engineering tells a model what to do at the moment of the query, without changing the model. Retrieval-augmented generation, RAG, gives the model your documents at query time as additional context. Fine-tuning is the only one of the three that changes the model itself. That is what makes it powerful. It is also what makes it expensive, slow to update, and risky if the underlying data is poor.
A vendor offering “we’ll customise the model on your data” is usually offering one of these three. Sometimes they will mix them. Knowing which they mean matters because the cost, the timeline, and the risks are different.
Why it matters for your business
Three things change when you fine-tune. Per-query cost drops at scale because the patterns are baked in and prompts can be shorter. Output gets more consistent, useful for any downstream system that depends on a fixed format. And risk profile shifts: the model becomes harder to audit and harder to update than a RAG pipeline pointing at fresh documents. Whether those trade-offs pay back depends almost entirely on your volume and your data quality.
The second is consistency of output. Fine-tuned models produce more reliably formatted answers than prompt-and-RAG setups. If your downstream system needs every output as valid JSON, every category labelled the same way, or every email with the same five-paragraph structure, fine-tuning is the cleaner approach. A 2026 academic study found human evaluators consistently preferred fine-tuned output for structured tasks, even when the prompts were carefully written.
The third is risk. Fine-tuning bakes patterns into model weights, including patterns from any bad data you trained on. The model becomes less interpretable, harder to audit under EU AI Act and UK ICO expectations, and harder to update than a RAG pipeline pointing at a fresh document store. For regulated decisions, hiring, lending, anything affecting individuals, that opacity is a compliance cost you should price in.
Where you will meet it
You will meet fine-tuning in vendor pitches where the language is “we’ll customise the model on your data”, “personalise it to your business”, “trained on your historical examples”, or simply “model customisation”. Sometimes the vendor will name the technique. Sometimes they will quietly mean RAG and the term is used loosely. Either way, ask which mechanism they are actually using and why.
You will also meet it in your own platforms if you use them at any scale. OpenAI’s fine-tuning sits in the platform console as a feature, with current pricing around £0.06 per million training tokens for GPT-5.4. AWS Bedrock calls the equivalent feature “Model Customisation” and offers it across Llama, Mistral, and other models on the platform. Hugging Face hosts open-source fine-tuning workflows that a technical team can run for the cost of compute alone. The economics have moved. The cost is no longer the main blocker.
The blocker now is data. A vendor offering to fine-tune the model on your data is implicitly assuming you have hundreds of clean, labelled examples to give them. The typical service business does not. They have transcripts, emails, and partial records, none of which are training data until someone has gone through and curated them.
When to ask about it, when to ignore it
Ask about fine-tuning when you have a high-volume, repetitive task where output consistency matters more than flexibility. Ticket routing at 5,000+ tickets a month. Quote generation in a fixed format. Document classification against a stable taxonomy. In those cases, fine-tuning earns its keep, provided the data is clean and the task is stable enough that the trained model will not be obsolete in three months.
Ignore the offer when any of four conditions are true. If you have fewer than 100 clean, labelled examples, fine-tuning will overfit and produce worse output than prompting. If your task is changing as you learn what good looks like, the fine-tuned model is constantly stale. If your data contains personal information you cannot lawfully send to the vendor’s training infrastructure, the compliance overhead may be larger than the benefit. And if a well-written prompt plus RAG already gets you to acceptable quality, the additional cost of fine-tuning is not earning anything.
The pattern I see most often in 2026 is the third one. A vendor offers fine-tuning because it is a higher-margin sale, and the SME accepts because the pitch sounds more sophisticated than “we’ll write better prompts”. The metric that matters is output quality at acceptable cost, which often points back to prompting and RAG.
Related concepts
Base model, or foundation model, is the pre-trained starting point: GPT-5.4, Claude Opus, Llama 3.1, Mistral. Fine-tuning takes one of these as input and produces a customised version. The base model is the work the AI provider has already done. The fine-tuned version is what you are paying to add on top.
Training data is the labelled examples used to fine-tune. Each example is an input paired with the correct output. The quality of training data sets a hard ceiling on the quality of the fine-tuned model. Garbage in, garbage out is mechanical and unforgiving.
LoRA, low-rank adaptation, is a cheaper variant of fine-tuning that adjusts a small subset of model weights rather than retraining the whole model. It costs less, runs faster, and preserves the base model’s general capabilities better. If a vendor is quoting fine-tuning unusually cheaply, ask whether they are using LoRA. The answer is usually yes, and that is fine.
Catastrophic forgetting is the name for what happens when fine-tuning narrows the model so aggressively that it loses general capability. The newly-fine-tuned model is great at your task and worse at everything else. Parameter-efficient methods like LoRA reduce the risk; aggressive full fine-tuning increases it.
Prompt engineering and RAG are the two cheaper alternatives. The mature pattern in 2026 is to start with prompting, layer in RAG when knowledge currency matters, and fine-tune only when you have the volume, the data, and the stability to make the maths work.
The honest test of a fine-tuning pitch is the data audit. If a vendor cannot tell you how many clean labelled examples they need, what they will do if you do not have enough, and how the fine-tuned model will be retrained when your processes evolve, the pitch is selling a technique, not a result.



