How distillation works in generative AI

A supplier sends you a proposal for an AI-powered customer query tool. The pitch mentions it runs on a “distilled model” to keep costs manageable. You nod, the phrase sounds sensible, and the conversation moves on. That pattern plays out in hundreds of vendor meetings across UK service businesses every week. The term matters more than it sounds, and understanding what it means changes the questions worth asking before you sign anything.

What is distillation in generative AI?

Distillation in generative AI creates a smaller, more efficient model by having a larger one teach it. The larger model, called the teacher, generates answers and sometimes step-by-step reasoning traces. The smaller model, called the student, learns to replicate those outputs. The result costs less to run and responds faster, trained for a specific task rather than general use.

The teacher might be a large frontier model or a capable proprietary model the vendor controls. It labels a set of inputs, sometimes producing detailed reasoning, and the student trains on those labels rather than on raw human-annotated data. That means the student needs far less labelled data to get started.

Google Research published “Distilling step-by-step” in 2022, showing that smaller student models can outperform models many times their size on specific tasks, using as little as one eighth the training data needed by conventional methods. The qualifier matters. Specific tasks only. Take the student outside the range it was trained on and performance drops.

Knowledge distillation as a machine learning technique has been in use since at least 2015. IBM describes it as an established method for compressing AI capability into a form that is practical to deploy at scale. What changed with large language models is the scale of capability being transferred.

Why does this matter for your business?

The business case for distillation comes down to cost and speed. UK government AI Insights found distilled models typically use 80 to 95 per cent fewer computing resources than frontier models, with some deployments cutting energy use by an order of magnitude. For a firm paying per API call or buying an AI subscription, that efficiency gap determines whether a narrow-task AI tool makes commercial sense at your scale.

That cost structure is why distillation appears behind many products aimed at smaller businesses. A vendor building an AI-powered tool for a ten-person professional services firm cannot pass on the full running cost of a frontier model on every query. Compressing the capability into a smaller, cheaper model is what makes the product viable at a price point the buyer will accept.

Response speed is the second factor. A distilled model has fewer layers to compute through, so replies come back faster. For anything customer-facing where a short delay affects completion rates, that difference is worth weighing when you compare products.

Where will you actually meet it?

For owner-managed businesses, distillation usually shows up not as a technical choice they make but as something a vendor has already done. When a software product describes itself as powered by a “custom AI”, an “in-house model”, or an AI “trained on sector data”, a distilled model is often behind that claim. Knowing this changes the questions worth asking before you commit to a platform.

The most common contexts for smaller businesses are customer-support tools, intake form processors, document classifiers, and internal FAQ assistants. Snorkel AI, whose researchers collaborated with Google on the step-by-step distillation work, describes the pattern as transferring a general LLM’s capability into a smaller model tuned for a defined job, such as categorising support tickets, flagging contract clauses, or routing enquiries to the right team.

What this means in practice is that the product’s AI capability is frozen at the point of training. If your business context changes, the model may not keep pace without retraining. A vendor who cannot explain what teacher model was used, what data shaped it, and when it was last evaluated is one worth questioning before you commit.

When should you ask about distillation?

The question of whether distillation is relevant to your firm comes down to one thing. Do you have a stable, repetitive task where speed or cost is a genuine constraint? Email triage, document classification, intake summaries, and FAQ responses are strong candidates. If the work varies, changes frequently, or requires judgement on inputs you cannot anticipate in advance, a distilled model is unlikely to hold up reliably.

The UK government’s model distillation guidance frames it as a technique suited to narrow tasks, not broad general intelligence. It also identifies the central risk, which is that the student model inherits whatever weaknesses the teacher had. If the teacher hallucinated on edge cases or over-generalised its outputs, those tendencies carry through to the student. Testing on your own sample of real cases matters more than vendor benchmark claims.

Distillation is the wrong tool if your process changes frequently, because you will spend more time retraining than you gain from the compression. If you need the model to retrieve live information or handle open-ended queries, a general model with retrieval-augmented generation is often better suited. For regulated firms, deploying a distilled model does not change FCA outsourcing expectations or ICO data protection obligations. A compact model is not a compliance shortcut.

When a vendor proposes a distilled model, ask what the teacher model was, what data was used to distil it, what failure modes were tested, and whether outputs have been evaluated on tasks that resemble your actual business cases. If those questions land awkwardly, treat that as informative.

Distillation sits alongside several other AI approaches you will hear about in vendor conversations. Fine-tuning adjusts an existing model’s weights using a curated dataset, without compressing the model. Retrieval-augmented generation, commonly called RAG, gives a model access to a knowledge base at query time, without any retraining. Prompt engineering shapes how a general model responds without touching the model at all. Each addresses the same underlying challenge by a different route.

The practical difference between distillation and fine-tuning trips up many early AI conversations. Fine-tuning produces a model of the original size, adjusted for a particular style or domain. Distillation produces something smaller. Both can give you a model that performs well on a defined task, but distillation produces something cheaper to deploy at volume.

RAG is worth separating out because for many small business use cases, particularly internal knowledge assistants or customer FAQ tools, a RAG setup built on a general model will outperform a distilled model and is far easier to keep current as your business changes.

Prompt engineering is often the right starting point before either distillation or fine-tuning. Well-shaped prompts can extract strong performance from a general model at no training cost. If a vendor is recommending distillation before you have run a structured prompt experiment with your own queries, ask why.

For small firms, the practical conclusion is that you are unlikely to run a distillation process yourself. The engineering effort is significant and only worthwhile for stable, high-volume tasks with well-understood output requirements. What you can do is understand the term well enough to ask whether a vendor’s distilled product suits your specific situation, and to ask for evidence rather than accepting the pitch on trust.

How distillation is used in generative AI

Key takeaways

What is distillation in generative AI?

Why does this matter for your business?

Where will you actually meet it?

When should you ask about distillation?

Sources

Frequently asked questions

What is the difference between distillation and fine-tuning in AI?

How do I know if a vendor's product uses a distilled model?

Does using a distilled model change UK GDPR obligations?

Ready to talk it through?

If any of this sounds familiar, let's talk.

How distillation is used in generative AI

Key takeaways

What is distillation in generative AI?

Why does this matter for your business?

Where will you actually meet it?

When should you ask about distillation?

What related AI techniques will you come across?

Sources

Frequently asked questions

What is the difference between distillation and fine-tuning in AI?

How do I know if a vendor's product uses a distilled model?

Does using a distilled model change UK GDPR obligations?

Ready to talk it through?

Related reading

How much AI does a founder actually need to understand?

Why data provenance matters for AI training sets and trust

What people mean by AI origin and source tracking

If any of this sounds familiar, let's talk.