How distillation is used in generative AI

Person sitting at a desk reading a document, natural daylight coming through a nearby window
TL;DR

Distillation in generative AI trains a smaller, cheaper model by having a larger one teach it. The result can run at 80 to 95 per cent lower compute cost than a frontier model, making it practical for narrow tasks like email triage or document classification. The student model inherits the teacher's limitations, including hallucinations. For a small firm, the decision is rarely whether to distil, but whether a vendor's distilled product genuinely fits your specific use case.

Key takeaways

- Distillation creates a smaller model by training it on the outputs of a larger one, making it cheaper and faster to run on a specific task. - The UK government has found distilled models consume 80 to 95 per cent fewer computing resources than frontier models, with some deployments cutting energy use by an order of magnitude. - A distilled model inherits its teacher's weaknesses, including hallucinations and generalisation errors, so testing on real business cases matters more than vendor benchmark claims. - Distillation suits stable, high-volume tasks such as email triage and document classification, but performs poorly when workflows change frequently or require open-ended reasoning. - When a vendor proposes a distilled model, ask what the teacher model was, what data was used, and how outputs were evaluated on tasks that match your actual business cases.

A supplier sends you a proposal for an AI-powered customer query tool. The pitch mentions it runs on a “distilled model” to keep costs manageable. You nod, the phrase sounds sensible, and the conversation moves on. That pattern plays out in hundreds of vendor meetings across UK service businesses every week. The term matters more than it sounds, and understanding what it means changes the questions worth asking before you sign anything.

What is distillation in generative AI?

Distillation in generative AI creates a smaller, more efficient model by having a larger one teach it. The larger model, called the teacher, generates answers and sometimes step-by-step reasoning traces. The smaller model, called the student, learns to replicate those outputs. The result costs less to run and responds faster, trained for a specific task rather than general use.

The teacher might be a large frontier model or a capable proprietary model the vendor controls. It labels a set of inputs, sometimes producing detailed reasoning, and the student trains on those labels rather than on raw human-annotated data. That means the student needs far less labelled data to get started.

Google Research published “Distilling step-by-step” in 2022, showing that smaller student models can outperform models many times their size on specific tasks, using as little as one eighth the training data needed by conventional methods. The qualifier matters: specific tasks. Take the student outside the range it was trained on and performance drops.

Knowledge distillation as a machine learning technique has been in use since at least 2015. IBM describes it as an established method for compressing AI capability into a form that is practical to deploy at scale. What changed with large language models is the scale of capability being transferred.

Why does this matter for your business?

The business case for distillation comes down to cost and speed. UK government AI Insights found distilled models typically use 80 to 95 per cent fewer computing resources than frontier models, with some deployments cutting energy use by an order of magnitude. For a firm paying per API call or buying an AI subscription, that efficiency gap determines whether a narrow-task AI tool makes commercial sense at your scale.

That cost structure is why distillation appears behind many products aimed at smaller businesses. A vendor building an AI-powered tool for a ten-person professional services firm cannot pass on the full running cost of a frontier model on every query. Compressing the capability into a smaller, cheaper model is what makes the product viable at a price point the buyer will accept.

Response speed is the second factor. A distilled model has fewer layers to compute through, so replies come back faster. For anything customer-facing where a short delay affects completion rates, that difference is worth weighing when you compare products.

Where will you actually meet it?

For owner-managed businesses, distillation usually shows up not as a technical choice they make but as something a vendor has already done. When a software product describes itself as powered by a “custom AI”, an “in-house model”, or an AI “trained on sector data”, a distilled model is often behind that claim. Knowing this changes the questions worth asking before you commit to a platform.

The most common contexts for smaller businesses are customer-support tools, intake form processors, document classifiers, and internal FAQ assistants. Snorkel AI, whose researchers collaborated with Google on the step-by-step distillation work, describes the pattern as transferring a general LLM’s capability into a smaller model tuned for a defined job: categorising support tickets, flagging contract clauses, or routing enquiries to the right team.

What this means in practice is that the product’s AI capability is frozen at the point of training. If your business context changes, the model may not keep pace without retraining. A vendor who cannot explain what teacher model was used, what data shaped it, and when it was last evaluated is one worth questioning before you commit.

When should you ask about distillation?

The question of whether distillation is relevant to your firm comes down to one thing: do you have a stable, repetitive task where speed or cost is a genuine constraint? Email triage, document classification, intake summaries, and FAQ responses are strong candidates. If the work varies, changes frequently, or requires judgment on inputs you cannot anticipate in advance, a distilled model is unlikely to hold up reliably.

The UK government’s model distillation guidance frames it as a technique suited to narrow tasks, not broad general intelligence. It also identifies the central risk: the student model inherits whatever weaknesses the teacher had. If the teacher hallucinated on edge cases or over-generalised its outputs, those tendencies carry through to the student. Testing on your own sample of real cases matters more than vendor benchmark claims.

Where distillation is the wrong tool: if your process changes frequently, you will spend more time retraining than you gain from the compression. If you need the model to retrieve live information or handle open-ended queries, a general model with retrieval-augmented generation is often better suited. For regulated firms, deploying a distilled model does not change FCA outsourcing expectations or ICO data protection obligations. A compact model is not a compliance shortcut.

The practical questions to put to any vendor proposing a distilled model: what was the teacher model, what data was used to distil it, what failure modes were tested, and have outputs been evaluated on tasks that resemble your actual business cases? If those questions land awkwardly, treat that as informative.

Distillation sits alongside several other AI approaches you will hear about in vendor conversations. Fine-tuning adjusts an existing model’s weights using a curated dataset, without compressing the model. Retrieval-augmented generation, commonly called RAG, gives a model access to a knowledge base at query time, without any retraining. Prompt engineering shapes how a general model responds without touching the model at all. Each addresses the same underlying challenge by a different route.

The practical difference between distillation and fine-tuning trips up many early AI conversations. Fine-tuning produces a model of the original size, adjusted for a particular style or domain. Distillation produces something smaller. Both can give you a model that performs well on a defined task, but distillation produces something cheaper to deploy at volume.

RAG is worth separating out because for many small business use cases, particularly internal knowledge assistants or customer FAQ tools, a RAG setup built on a general model will outperform a distilled model and is far easier to keep current as your business changes.

Prompt engineering is often the right starting point before either distillation or fine-tuning. Well-shaped prompts can extract strong performance from a general model at no training cost. If a vendor is recommending distillation before you have run a structured prompt experiment with your own queries, ask why.

For small firms, the conclusion is practical: you are unlikely to run a distillation process yourself. The engineering effort is significant and only worthwhile for stable, high-volume tasks with well-understood output requirements. What you can do is understand the term well enough to ask whether a vendor’s distilled product suits your specific situation, and to ask for evidence rather than accepting the pitch on trust.

Sources

- UK Government AI Insights (2024). Model Distillation. Explains compute and energy efficiency gains of distilled models versus frontier models, including the 80 to 95 per cent compute reduction figure. https://www.gov.uk/government/publications/ai-insights/ai-insights-model-distillation-html - Google Research (2022). Distilling Step-by-Step: Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. Named study showing student models outperforming larger models on specific tasks using as little as one eighth the training data of conventional methods. https://research.google/blog/distilling-step-by-step-outperforming-larger-language-models-with-less-training-data-and-smaller-model-sizes/ - Snorkel AI (2024). LLM Distillation Demystified: A Complete Guide. Explains the teacher-student training approach and its application to task-specific smaller models, including step-by-step distillation. https://snorkel.ai/blog/llm-distillation-demystified-a-complete-guide/ - IBM (2024). Knowledge Distillation. Mainstream enterprise explanation of teacher-to-student knowledge transfer and deployment efficiency as an established machine learning method. https://www.ibm.com/think/topics/knowledge-distillation - ICO (2024). AI and Data Protection. UK regulator guidance confirming that AI systems, including custom or distilled models, remain subject to full UK GDPR obligations. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ - FCA (2024). Outsourcing and Third-Party Risk Management. Relevant to regulated firms using AI vendor products, including those built on distilled models, where operational resilience obligations apply. https://www.fca.org.uk/firms/outsourcing-third-party-risk-management - NCSC (2024). Artificial Intelligence. UK cybersecurity baseline for operational AI deployment, covering supply-chain risk and model integrity considerations. https://www.ncsc.gov.uk/collection/artificial-intelligence - EU AI Act (2024). Regulation (EU) 2024/1689. Classification and transparency obligations that apply to AI systems, including distilled models, when deployed in EU-facing contexts. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689 - Quanta Magazine (2025). How Distillation Makes AI Models Smaller and Cheaper. Independent science journalism reporting on open-source distillation work validating that capable smaller models can be produced at materially lower training cost. https://www.quantamagazine.org/how-distillation-makes-ai-models-smaller-and-cheaper-20250718/

Frequently asked questions

What is the difference between distillation and fine-tuning in AI?

Distillation produces a smaller model by training it on the outputs of a larger one. Fine-tuning adjusts the weights of an existing model using a curated dataset but keeps the model the same size. Both can improve performance on a specific task, but distillation gives you something cheaper to run at scale, while fine-tuning keeps a larger, more flexible model and shapes its behaviour on your data.

How do I know if a vendor's product uses a distilled model?

Vendors rarely name distillation explicitly, but certain phrases signal it. Claims of a "custom AI", an "in-house model", or an AI "trained on sector data", combined with low running costs or fast response times, often indicate a distilled model underneath. Ask the vendor directly what base model was used, what data shaped it, and when it was last evaluated on tasks like yours. A vendor with a reliable product should answer clearly.

Does using a distilled model change UK GDPR obligations?

A smaller, cheaper model does not remove data protection responsibilities. If a distilled model processes personal data about customers, staff, or others, UK GDPR applies in full. You still need a lawful basis, data minimisation controls, and supplier due diligence on the vendor. The ICO's AI and data protection guidance covers this explicitly, and your responsibility as the data controller does not transfer to the vendor because the model is compact.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation