A practice manager at a 25-staff UK accountancy firm sent me a vendor quote last month. Eighteen thousand pounds to fine-tune a custom model on twelve months of historical email, plus nine hundred a month in hosting and retraining, all to triage inbound mail into billing, technical, scheduling, and general. She had heard somewhere that you could do this kind of thing “with a prompt”, and was trying to work out what that actually meant.
She ran a one-hour test. Thirty real emails pasted into Claude with the instruction “classify each email into billing, technical, scheduling, or general, output one word.” Twenty-seven out of thirty correct on the first attempt, no examples, no setup, total cost roughly four pence in tokens. She added three example emails for the borderline cases. Thirty out of thirty on her sample. She wanted to know what the eighteen thousand was actually paying for.
What is the difference between zero-shot and few-shot learning?
Zero-shot means asking the model to do the task with no examples in your prompt, relying on patterns it absorbed during pre-training. Few-shot means including two to ten correct input-output examples directly in the prompt itself. Neither retrains the model. Neither updates weights. Both happen at the moment you run the prompt, which is why researchers call them in-context learning rather than training.
Fine-tuning is the third rung and is genuinely different. It retrains the model on a labelled dataset, updates its internal weights, and produces a new model checkpoint you then own and maintain. It needs hundreds to thousands of examples, a data scientist or specialist, and a budget that lands well above the prompt-only options. The word “training” in “few-shot training” is a holdover from the research literature and misleads in plain English.
When does zero-shot work straight out of the box?
Zero-shot works well for tasks the model has seen at scale during pre-training. Four use cases reliably clear the SME accuracy bar in 2026: support ticket classification at 70 to 85 percent, document summarisation across many document types, sentiment analysis at 80 to 88 percent, and named entity recognition for common entities like names, companies and dates. If your task fits one of these and clears your threshold, you are done.
The unifying thread is semantic clarity. The model does not need an example because the concept is already deeply represented in its weights. The practical implication is that if your use case is routine classification, summarisation, sentiment, or straightforward entity extraction, you should test zero-shot first on 50 to 100 real samples and measure the accuracy by hand. If you clear the threshold, deploy and stop. The post is over for that workload.
Where zero-shot drops off is on domain-specific entities, internal classification labels, and tasks that need a particular firm voice or output format. Medical conditions, regulatory codes, internal priority categories, structured JSON output, ranked recruitment criteria. The model has seen the general shape but does not know your specifics. That is the moment few-shot earns its place.
When does few-shot dramatically beat zero-shot?
The clearest published case comes from airline-tweet sentiment classification. Zero-shot reached 19 percent accuracy on the test set. Few-shot with four carefully chosen examples reached 97 percent accuracy on the same test set, with no retraining, no weight updates, no new data. Just four good examples in the prompt. That is the difference between an unusable system and a production-ready one.
Few-shot wins specifically when the task involves domain-specific terminology, formatting precision, or stylistic consistency. A recruiter scoring candidates with internal ranking criteria typically sees zero-shot land at 60 to 65 percent alignment with their judgement; three to five worked examples lift that to 75 to 85 percent. Structured JSON output goes from 40 to 60 percent parseable in zero-shot mode to 85 to 95 percent in few-shot mode with one to three correctly formatted examples.
Quality and diversity beat quantity. Research on financial sentiment and code translation shows performance often peaks between 3 and 25 examples and degrades sharply beyond that. One published code-translation result hit peak accuracy at 25 examples, declined through 625, and ended worse than 25. Three or four well-chosen, diverse examples will usually beat twenty hastily picked ones. Design the prompt deliberately rather than stuffing it.
When is fine-tuning actually justified?
Fine-tuning earns its place at sustained scale on a stable narrow task, when a properly engineered few-shot prompt still falls short, or when data residency forces the model inside your own boundary. Vendor pricing in 2026 puts a 1,000-example OpenAI fine-tune at roughly £15 to £75, completing in 30 to 90 minutes. The breakeven against few-shot with prompt caching sits around 50,000 to 100,000 queries a month.
Most UK SMEs in the £1m to £10m turnover band do not reach that ceiling. A typical service business runs 1,000 to 5,000 classification or extraction tasks a month per workload. Anthropic’s prompt caching gives a 90 percent discount on cached input tokens after a 25 percent write premium on the first call, which means a static few-shot prompt costs less than the headline numbers suggest. The economic case for fine-tuning rarely closes for service businesses operating at this scale.
Fine-tuning also locks you to a base model version. When the provider releases a new one, you either stay on a deprecating base or pay to re-run the fine-tune. That maintenance tax matters. The honest 2026 pattern for many SMEs is prompts and retrieval handle knowledge, few-shot handles behaviour, and fine-tuning is reserved for the narrow case where neither does the job at the volume you actually run.
How should you decide for your business?
Run a five-step sequence. One, identify the task category: classification, summarisation, extraction, sentiment, scoring. Two, build a clear zero-shot prompt and test 50 to 100 real samples by hand against your tolerance threshold, usually 80 to 85 percent. Three, if zero-shot misses, add three to five diverse examples and retest. Four, refine the prompt structure before reaching for fine-tuning. Five, consider fine-tuning only at sustained volume.
Watch the four misconceptions that mislead procurement. Zero-shot does not mean the model knows nothing; it has been pre-trained on billions of tokens. Few-shot examples do not need a special technical format; plain text question-answer pairs work as well as XML or JSON. More examples is not always better; peak performance is usually 3 to 25. Zero-shot and fine-tuning are not the only options; few-shot is the dominant middle path for SME work and is usually all you need.
The procurement question writes itself. If a vendor leads with fine-tuning, ask them to show you the zero-shot baseline first, then the few-shot result, then the volume-and-cost calculation that makes the fine-tune pay back. Most of the time the cheaper rung does the job. The eighteen-thousand-pound quote in the practice manager’s inbox failed that test in an hour, for four pence in tokens, with a prompt anyone in the firm could have written. That is the shape of the conversation worth having before you sign anything.



