What is AI model distillation?

A supplier offers you an AI-assisted tool for processing client documents or handling incoming queries. When you ask which model it uses, they mention a “lite” version of one of the main platforms. You take their word for it.

Three months later, the outputs feel inconsistent. Some summaries are sharp; others miss obvious context. When you go back to the vendor, they reference model tiers and accuracy trade-offs. That is when you realise you never asked the question that mattered. What does “lite” actually mean, and who made that trade-off on your behalf?

The answer, in almost every case, traces to a technique called knowledge distillation. Understanding it will not make you an AI engineer. But it will give you the vocabulary to ask better questions of vendors, choose model tiers deliberately, and know when the cheaper option is genuinely good enough.

What is AI distillation?

Knowledge distillation is when a large AI model, called the teacher, trains a smaller, cheaper model, called the student, to behave in a similar way. The teacher shares its full distribution of confidence across all possible responses, not just a correct answer. That richer signal allows the student to retain much of the teacher’s accuracy at a fraction of the running cost.

The process works in three broad steps. First, the teacher, which may have hundreds of billions of parameters and cost millions in compute to build, processes a large set of examples. For each one, it produces what researchers call “soft targets”, a probability distribution showing not just which answer it prefers, but how confident it is across all possibilities. A model asked whether an email is a complaint might assign 82 per cent confidence to yes, 13 per cent to uncertain, and 5 per cent to no. That nuance is the learning signal.

The student model then trains on those soft targets alongside the original labelled examples. Because soft targets carry far more information than simple right-or-wrong labels, the student can achieve high accuracy while being significantly smaller.

Hinton, Vinyals, and Dean published the foundational version of this technique in 2015. Since then it has become one of the primary routes AI labs take to build cheap, deployable models from their most capable research systems. DistilBERT, one well-documented example, achieved 60 per cent faster inference with 40 per cent fewer parameters at a modest accuracy cost.

Why does it matter for your business?

The main reason distillation matters for owner-managed businesses is cost. Moving from a premium AI model tier to a cheaper distilled version can cut inference costs by 70 to 90 per cent per million tokens. For everyday tasks such as drafting, summarising, and classifying, the quality difference between teacher and student models is often modest while the price difference is substantial.

This shows up in tools you already subscribe to. When you use AI features inside Microsoft 365 or Google Workspace, you are not necessarily getting the largest model in every interaction. You are getting whatever model the vendor judged appropriate for that task at that cost. Distillation is a key reason they can deliver those features at scale.

For businesses choosing model tiers directly, the decision is usually straightforward. Research shows distilled models can retain over 95 per cent of the teacher’s accuracy on language tasks while running considerably faster. For drafting standard communications, summarising reports, or classifying incoming queries, a distilled model will typically serve you well.

Where the calculus changes is in complex or specialised work. A distilled model trained on broad tasks may fall short in a narrow domain where even the teacher was operating close to its limits. That is the use case where a premium tier earns its cost, and the important thing is making that call deliberately rather than by default.

Where will you actually meet distillation?

You will meet distillation most commonly when a platform offers a cheaper, faster model alongside a more capable one. In late 2023, GPT-3.5 Turbo cost roughly one-tenth the price of GPT-4 per thousand tokens, a price gap created in part by distillation techniques. The same pattern runs through Microsoft 365, Google Workspace, and many CRM systems, where AI features rely on a compressed model rather than the full-sized one.

DeepSeek, a Chinese AI lab, demonstrated in early 2025 how far distillation can stretch. Its efficient models matched much of the performance of far larger competitors by applying distillation aggressively, at a fraction of the compute cost. The episode confirmed that distillation is now mainstream production strategy across commercial AI development.

For owner-managed businesses, the three most common encounter points are choosing between model tiers within a platform, where standard versus premium typically reflects a distilled versus full model; using embedded AI inside existing software, where the vendor has already made the model choice; and commissioning custom AI tools on open-source foundations such as the LLaMA family, where distillation is often applied to shrink models for deployment on modest infrastructure.

Vendors rarely volunteer which tier you are on or what the trade-offs involve. Asking is reasonable, expected, and increasingly a sign of a mature buyer.

When does the teacher-student gap actually matter?

The teacher-student gap matters when the task demands nuanced reasoning, handles rare edge cases, or carries real consequences if the model gets it wrong. For routine business work, a well-built student model typically performs well enough. In regulated contexts, such as financial advice or legal document review, accuracy standards are higher and any trade-off a distilled model represents needs to be explicitly assessed and documented.

The ICO’s guidance on AI and data protection makes clear that organisations using AI systems involving personal data carry obligations around lawful basis, transparency, and data subject rights. Distillation itself does not change those duties. It changes the cost and performance profile of the tool, while your obligations as controller remain constant.

In financial services, the FCA expects firms to maintain oversight of AI used in decision-making, including understanding the limitations and trade-offs of the models involved. The Bank of England and FCA joint survey on machine learning found model governance frequently weak in practice, particularly when firms adopt vendor AI without interrogating the underlying model choices.

The NCSC’s guidance on secure AI system development raises a further dimension. If a distilled model is deployed on your own infrastructure rather than a cloud provider’s, your responsibilities for endpoint security, patching, and integrity checks expand accordingly. The efficiency gain of a smaller model comes with a corresponding shift in where the security burden sits.

What sits alongside distillation that’s worth knowing?

Distillation often gets confused with fine-tuning, quantisation, and pruning. Fine-tuning trains an existing model further on a specific dataset without shrinking it. Quantisation stores weights in lower-precision format to reduce memory use. Pruning removes redundant connections from a network. All four make AI more efficient or more specific, but they solve different problems and the distinction matters when you are questioning a vendor.

The CMA’s investigation into foundation model markets found transparency insufficient for downstream business users, including owner-managed businesses, on which model is in use and what its limitations are. Expect clearer model-tier labelling to become a standard expectation in AI procurement contracts as regulatory pressure increases.

If a vendor proposes a custom solution involving distillation on your own data, three questions matter above all others. What is the base model and who created it? Was your data used in the distillation process, and what data processing agreements govern that? What is the documented accuracy trade-off compared to the teacher, and can you see evidence on tasks relevant to your work?

You do not need to understand the mathematics behind distillation. You need to know that the trade-off was made consciously, with your interests and obligations represented in that decision. If a vendor cannot answer those questions clearly, that is the information you needed.

What is AI model distillation? Teacher and student models explained

Key takeaways

What is AI distillation?

Why does it matter for your business?

Where will you actually meet distillation?

When does the teacher-student gap actually matter?

What sits alongside distillation that’s worth knowing?

Sources

Frequently asked questions

What is the difference between a teacher model and a student model in AI?

Do I need to understand distillation to use AI tools in my business?

If a vendor uses a distilled model, does that affect my data protection obligations?

Ready to talk it through?

If any of this sounds familiar, let's talk.

What is AI model distillation? Teacher and student models explained

Key takeaways

What is AI distillation?

Why does it matter for your business?

Where will you actually meet distillation?

When does the teacher-student gap actually matter?

What sits alongside distillation that’s worth knowing?

Sources

Frequently asked questions

What is the difference between a teacher model and a student model in AI?

Do I need to understand distillation to use AI tools in my business?

If a vendor uses a distilled model, does that affect my data protection obligations?

Ready to talk it through?

Related reading

How much AI does a founder actually need to understand?

Why data provenance matters for AI training sets and trust

What people mean by AI origin and source tracking

If any of this sounds familiar, let's talk.