Zero-shot vs few-shot learning: when AI works on tiny data

A practice manager at a desk holding a printed proposal in one hand with an open laptop alongside
TL;DR

Zero-shot, few-shot, and fine-tuning are three points on a cost ladder, and SME owners are routinely sold the most expensive option when the cheapest would do. Test zero-shot first on 50 to 100 real samples. If accuracy is below tolerance, add three to five examples in the prompt (few-shot). Only consider fine-tuning above roughly 50,000 queries a month on a stable narrow task. For many UK service businesses, the answer locks in at step one or two.

Key takeaways

- Zero-shot means asking the model with no examples. Few-shot means including 2 to 10 examples in the prompt itself. Neither retrains the model. - For routine classification, summarisation, sentiment and entity extraction, zero-shot reaches 70 to 88 percent accuracy with no setup. - The airline-tweet case study shows few-shot with four examples lifting accuracy from 19 percent to 97 percent on the same test set, no retraining. - More examples is not always better. Performance often peaks between 3 and 25 examples and degrades beyond that point. - Fine-tuning rarely pays back below roughly 50,000 to 100,000 queries a month on a stable task. Few-shot is the dominant middle path for SME work.

A practice manager at a 25-staff UK accountancy firm sent me a vendor quote last month. Eighteen thousand pounds to fine-tune a custom model on twelve months of historical email, plus nine hundred a month in hosting and retraining, all to triage inbound mail into billing, technical, scheduling, and general. She had heard somewhere that you could do this kind of thing “with a prompt”, and was trying to work out what that actually meant.

She ran a one-hour test. Thirty real emails pasted into Claude with the instruction “classify each email into billing, technical, scheduling, or general, output one word.” Twenty-seven out of thirty correct on the first attempt, no examples, no setup, total cost roughly four pence in tokens. She added three example emails for the borderline cases. Thirty out of thirty on her sample. She wanted to know what the eighteen thousand was actually paying for.

What is the difference between zero-shot and few-shot learning?

Zero-shot means asking the model to do the task with no examples in your prompt, relying on patterns it absorbed during pre-training. Few-shot means including two to ten correct input-output examples directly in the prompt itself. Neither retrains the model. Neither updates weights. Both happen at the moment you run the prompt, which is why researchers call them in-context learning rather than training.

Fine-tuning is the third rung and is genuinely different. It retrains the model on a labelled dataset, updates its internal weights, and produces a new model checkpoint you then own and maintain. It needs hundreds to thousands of examples, a data scientist or specialist, and a budget that lands well above the prompt-only options. The word “training” in “few-shot training” is a holdover from the research literature and misleads in plain English.

When does zero-shot work straight out of the box?

Zero-shot works well for tasks the model has seen at scale during pre-training. Four use cases reliably clear the SME accuracy bar in 2026: support ticket classification at 70 to 85 percent, document summarisation across many document types, sentiment analysis at 80 to 88 percent, and named entity recognition for common entities like names, companies and dates. If your task fits one of these and clears your threshold, you are done.

The unifying thread is semantic clarity. The model does not need an example because the concept is already deeply represented in its weights. The practical implication is that if your use case is routine classification, summarisation, sentiment, or straightforward entity extraction, you should test zero-shot first on 50 to 100 real samples and measure the accuracy by hand. If you clear the threshold, deploy and stop. The post is over for that workload.

Where zero-shot drops off is on domain-specific entities, internal classification labels, and tasks that need a particular firm voice or output format. Medical conditions, regulatory codes, internal priority categories, structured JSON output, ranked recruitment criteria. The model has seen the general shape but does not know your specifics. That is the moment few-shot earns its place.

When does few-shot dramatically beat zero-shot?

The clearest published case comes from airline-tweet sentiment classification. Zero-shot reached 19 percent accuracy on the test set. Few-shot with four carefully chosen examples reached 97 percent accuracy on the same test set, with no retraining, no weight updates, no new data. Just four good examples in the prompt. That is the difference between an unusable system and a production-ready one.

Few-shot wins specifically when the task involves domain-specific terminology, formatting precision, or stylistic consistency. A recruiter scoring candidates with internal ranking criteria typically sees zero-shot land at 60 to 65 percent alignment with their judgement; three to five worked examples lift that to 75 to 85 percent. Structured JSON output goes from 40 to 60 percent parseable in zero-shot mode to 85 to 95 percent in few-shot mode with one to three correctly formatted examples.

Quality and diversity beat quantity. Research on financial sentiment and code translation shows performance often peaks between 3 and 25 examples and degrades sharply beyond that. One published code-translation result hit peak accuracy at 25 examples, declined through 625, and ended worse than 25. Three or four well-chosen, diverse examples will usually beat twenty hastily picked ones. Design the prompt deliberately rather than stuffing it.

When is fine-tuning actually justified?

Fine-tuning earns its place at sustained scale on a stable narrow task, when a properly engineered few-shot prompt still falls short, or when data residency forces the model inside your own boundary. Vendor pricing in 2026 puts a 1,000-example OpenAI fine-tune at roughly £15 to £75, completing in 30 to 90 minutes. The breakeven against few-shot with prompt caching sits around 50,000 to 100,000 queries a month.

Most UK SMEs in the £1m to £10m turnover band do not reach that ceiling. A typical service business runs 1,000 to 5,000 classification or extraction tasks a month per workload. Anthropic’s prompt caching gives a 90 percent discount on cached input tokens after a 25 percent write premium on the first call, which means a static few-shot prompt costs less than the headline numbers suggest. The economic case for fine-tuning rarely closes for service businesses operating at this scale.

Fine-tuning also locks you to a base model version. When the provider releases a new one, you either stay on a deprecating base or pay to re-run the fine-tune. That maintenance tax matters. The honest 2026 pattern for many SMEs is prompts and retrieval handle knowledge, few-shot handles behaviour, and fine-tuning is reserved for the narrow case where neither does the job at the volume you actually run.

How should you decide for your business?

Run a five-step sequence. One, identify the task category: classification, summarisation, extraction, sentiment, scoring. Two, build a clear zero-shot prompt and test 50 to 100 real samples by hand against your tolerance threshold, usually 80 to 85 percent. Three, if zero-shot misses, add three to five diverse examples and retest. Four, refine the prompt structure before reaching for fine-tuning. Five, consider fine-tuning only at sustained volume.

Watch the four misconceptions that mislead procurement. Zero-shot does not mean the model knows nothing; it has been pre-trained on billions of tokens. Few-shot examples do not need a special technical format; plain text question-answer pairs work as well as XML or JSON. More examples is not always better; peak performance is usually 3 to 25. Zero-shot and fine-tuning are not the only options; few-shot is the dominant middle path for SME work and is usually all you need.

The procurement question writes itself. If a vendor leads with fine-tuning, ask them to show you the zero-shot baseline first, then the few-shot result, then the volume-and-cost calculation that makes the fine-tune pay back. Most of the time the cheaper rung does the job. The eighteen-thousand-pound quote in the practice manager’s inbox failed that test in an hour, for four pence in tokens, with a prompt anyone in the firm could have written. That is the shape of the conversation worth having before you sign anything.

Sources

IBM (2024). Zero-shot learning explained. https://www.ibm.com/think/topics/zero-shot-learning IBM (2024). Few-shot learning explained. https://www.ibm.com/think/topics/few-shot-learning Labelbox (2024). Zero-shot, few-shot and fine-tuning trade-offs, including the airline-tweet sentiment case from 19 to 97 percent. https://labelbox.com/guides/zero-shot-learning-few-shot-learning-fine-tuning/ Brown et al. (2020). Language models are few-shot learners (GPT-3). https://arxiv.org/abs/2005.14165 Prompting Guide (2024). Few-shot prompting techniques. https://www.promptingguide.ai/techniques/fewshot Loukas et al. (2023). Making LLMs worth every penny: financial sentiment few-shot benchmarking with diminishing returns past five shots. https://arxiv.org/html/2312.08725v1 Many-shot in-context learning research (2025). Code-translation peak at 25 examples then decline through 625. https://arxiv.org/html/2510.16809v2 LLM-Stats (2026). Fine-tuning vs prompt engineering cost and volume breakeven analysis. https://llm-stats.com/blog/research/fine-tuning-vs-prompt-engineering-2026 AWS (2024). Prescriptive guidance: RAG vs fine-tuning architecture choice. https://docs.aws.amazon.com/prescriptive-guidance/latest/retrieval-augmented-generation-options/rag-vs-fine-tuning.html Mole Valley Chamber (2026). UK SME AI Adoption Report 2026. https://molevalleychamber.co.uk/uk-sme-ai-adoption-report-2026/

Frequently asked questions

How do I know if zero-shot is good enough for my use case?

Run 50 to 100 real samples through a clear zero-shot prompt and grade the output by hand against your tolerance threshold, typically 80 to 85 percent for routing or extraction. If accuracy clears the threshold, deploy and stop. If it falls short, build a few-shot prompt with three to five diverse examples and retest the same set. Many SME use cases lock in at one of those two stages.

Does few-shot retrain the model on my examples?

No. Few-shot examples sit inside the prompt at the moment of use and influence one response only. The model's internal weights never change, no training job runs, and the next prompt without examples will behave as if the previous one never happened. The word "training" in "few-shot training" is a holdover from the research literature and is misleading in plain English. Both zero-shot and few-shot happen at inference time.

When does fine-tuning actually justify its cost for an SME?

When monthly volume on a stable narrow task crosses roughly 50,000 to 100,000 queries, when few-shot has been engineered properly and still falls short, or when a smaller fine-tuned model needs to run on your own infrastructure for data-residency reasons. For a typical UK service business in the £1m to £10m turnover band processing a few thousand classification or extraction tasks a month, that ceiling is rarely reached and few-shot with prompt caching is cheaper across a 12-month horizon.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation