Prompt engineering vs fine-tuning: which your firm needs

A consultant I work with described two vendor pitches for the same task. One offered to write a “carefully engineered system prompt” with a knowledge base of her firm’s templates for £400 a month. The other offered to “fine-tune a model on your past proposals” for an £8,000 build and £600 a month. Same problem, same target output, two completely different shapes of solution. She asked me which one she should be choosing.

It is a common question by 2026, and the right answer for the typical SME is the cheaper one. But not always. The plain-English version tells you when to upgrade and when the cheaper option keeps doing the work.

The choice you’re facing

Prompt engineering means writing better instructions for an existing AI model, often combined with retrieval over your documents (the RAG pattern). The model is unchanged. You change what you ask it and what context you give it. Cost is the per-token API charge plus the work of refining the prompts. Reversibility is total: change a sentence, change the behaviour.

Fine-tuning means formally training the model on your labelled examples. The model’s internal weights are adjusted to bake patterns into the model itself. OpenAI, Anthropic, Google, AWS Bedrock and the open-weight tools all offer it. Cost runs from a few hundred pounds for a small parameter-efficient fine-tune (LoRA or QLoRA) to several thousand for a larger one, plus the work of preparing clean labelled data. Reversibility is partial.

A middle layer sits between them: parameter-efficient fine-tuning. LoRA and QLoRA train small “adapter” matrices on top of a frozen base model, capturing task-specific patterns without rewriting the whole model. They are cheaper, faster and more reversible than full fine-tuning, and have made the technique reachable for SMEs. The bulk of 2026 fine-tuning is parameter-efficient.

Three thresholds decide which path you should be on: how much labelled data you have, how consistent the behaviour needs to be, and how big your token bill is.

When prompt engineering (with RAG) is the right answer

For the typical UK SME, prompt engineering plus retrieval is the right answer almost all of the time.

It is the right answer for information-retrieval tasks. A team that wants the AI to answer from its own documents (policies, case notes, past proposals) is solving a retrieval problem, not a behaviour problem. RAG gives the model live access at query time. Trying to fine-tune your way to “knows our pricing” is the wrong tool. Update the document, the answer updates the same day.

It is the right answer for the exploratory phase. A use case still being shaped does not have stable requirements to justify the data-prep work fine-tuning needs. Iterating on a prompt is fast; iterating on a fine-tune is a build cycle.

It is the right answer when your data is messy. Fine-tuning amplifies the patterns in your training data, the good and the bad. A team without a clean, consistent body of labelled examples will end up with a fine-tune that has memorised the inconsistency.

It is the right answer for low to moderate volume. Below a few hundred million tokens a month, the per-token saving from a smaller fine-tuned model rarely pays back the build cost.

The 2026 craft is good prompts plus good retrieval. A few-shot prompt with three or four well-chosen examples plus a clean RAG pipeline outperforms many early-stage fine-tunes.

When fine-tuning is the right answer

Three thresholds justify migration to fine-tuning, and the case is strongest when more than one is crossed at once.

The first is data volume. A working rule of thumb across vendor guidance is at least 500 to 1,000 cleanly-labelled examples for a focused task, often more for nuanced behaviour. Below that, prompt engineering with in-prompt examples usually wins. Above several thousand, the model can learn behaviours no single prompt can express.

The second is behavioural consistency. Tasks that require the model to write in a specific house style across thousands of outputs, follow a non-obvious internal format, or apply judgement that resists rule-by-rule expression, are where fine-tuning earns its keep. A bid-writing team that wants every proposal in their voice will hit a prompting plateau. Fine-tuning past that plateau is what the technique is for.

The third is token economics at scale. A fine-tuned smaller model can be substantially cheaper per token than a larger SaaS model doing the same job. The break-even depends on model size, volume and current API rates, but for stable narrow workloads above a hundred million tokens a month, the saving compounds.

A fourth case, less common but worth knowing, is data privacy. Fine-tuning an open-weight model on your own infrastructure means the labelled examples never leave your boundary. For sectors where this is non-negotiable, it can be a stronger reason than the cost case.

The mature 2026 pattern is rarely fine-tuning instead of retrieval. It is fine-tuning for behaviour and retrieval for knowledge. The fine-tune handles how the model writes and reasons; RAG handles what it draws on. Both, working together, behind a multi-model gateway.

What it costs to get wrong

The expensive failure modes go in both directions.

Fine-tuning when prompt engineering would have done is the common one. A vendor pitches the fine-tune as the sophisticated solution; the SME pays the build cost and a monthly retainer; six months in, the use case is still shaping and the fine-tune is already drifting from the new base model. Total spend exceeds equivalent SaaS prompts plus RAG by a factor of two or three, and the team has lost agility.

The opposite mistake is staying on prompts past the point where a fine-tune would pay back. A team running a hundred million tokens a month of a stable, narrow task, with a thousand clean labelled examples in a folder, is paying premium per-token rates when a fine-tuned smaller model would do the same job for less.

Brittleness when the base model updates is a third trap. A fine-tune is locked to a specific base version. When the provider releases a new one (every twelve to eighteen months by 2026 norms), the fine-tune either stays on a deprecating base or has to be re-run, which costs again.

Data privacy and IP missteps are the last category. Fine-tuning a SaaS model means uploading your training examples to their infrastructure. The ICO’s AI guidance is clear that you remain the data controller and you owe the lawful basis. Customer contracts often forbid using their data to train models. Read the contracts before you upload.

What to ask before you decide

Five questions, in order.

One: do you actually have a thousand or more cleanly-labelled examples of the task, or could you assemble them? If the data does not exist in usable form, fine-tuning is not the next step. Better prompts and a RAG layer over your existing documents is.

Two: what is the gap between what good prompts do today and what you need? If the gap is “the model needs our latest pricing”, that is retrieval. If the gap is “the model writes nothing like us no matter what I tell it”, that is behaviour and fine-tuning is on the table.

Three: what is your monthly token volume on this workload? Below 50 million, fine-tuning rarely pays back. Above 100 million on a stable narrow task, it often does.

Four: who owns the model artefact and the training data after the build? If a vendor is doing the fine-tune and the result lives in their account, you have built a dependency that is hard to walk away from. Insist on portability and data-export terms.

Five: what is your plan when the underlying base model is deprecated? A fine-tune with no migration plan attached is a fixed cost waiting to become a sunk one.

The honest answer for the typical UK SME in 2026 is start with good prompts and a good RAG layer, run that for as long as it works, and let the data and the bill tell you when fine-tuning has earned its place. Many teams never need to. The ones that do should make the move deliberately, not because a vendor’s pitch deck implied it was the next obvious step.

Prompt engineering vs fine-tuning: which one your business actually needs

Key takeaways

The choice you’re facing

When prompt engineering (with RAG) is the right answer

When fine-tuning is the right answer

What it costs to get wrong

What to ask before you decide

Sources

Frequently asked questions

How many examples do I need before fine-tuning is worth it?

Will fine-tuning replace my need for retrieval?

What happens to my fine-tune when the underlying model is updated?

Ready to talk it through?

If any of this sounds familiar, let's talk.

Prompt engineering vs fine-tuning: which one your business actually needs

Key takeaways

The choice you’re facing

When prompt engineering (with RAG) is the right answer

When fine-tuning is the right answer

What it costs to get wrong

What to ask before you decide

Sources

Frequently asked questions

How many examples do I need before fine-tuning is worth it?

Will fine-tuning replace my need for retrieval?

What happens to my fine-tune when the underlying model is updated?

Ready to talk it through?

Related reading

Zero-shot vs few-shot learning: when AI works on tiny data

What is AutoML? Why it matters for your business

What is edge AI? Why running AI locally matters for your business

If any of this sounds familiar, let's talk.