What is AI model distillation? Teacher and student models explained

A person reviewing printed documents at a desk next to a laptop and a window
TL;DR

Knowledge distillation is the process by which AI vendors take a large, expensive model, called the teacher, and train a smaller, cheaper one, called the student, to match its behaviour. The technique is why the same AI platform often offers a premium and a standard option at very different prices. Understanding it helps you choose the right model tier, ask better vendor questions, and know when the cheaper option is genuinely good enough.

Key takeaways

- Knowledge distillation trains a smaller student model to mimic a larger teacher model's outputs, producing a model that is cheaper and faster to run while retaining much of the original accuracy. - Moving from a premium to a distilled model tier can cut AI inference costs by 70 to 90 per cent, which is why platforms offer standard and premium model options at very different prices. - Everyday business tasks such as drafting, summarising, and classifying are generally well served by distilled models; high-stakes or complex reasoning tasks may need the more capable teacher model. - Distillation does not alter your data protection or regulatory obligations; the ICO expects you to understand and govern how any AI system, distilled or not, uses personal data. - Treat the question "is this a distilled model?" as standard due diligence in any AI procurement discussion, alongside questions about accuracy trade-offs and what happens to your data.

A supplier offers you an AI-assisted tool for processing client documents or handling incoming queries. When you ask which model it uses, they mention a “lite” version of one of the main platforms. You take their word for it.

Three months later, the outputs feel inconsistent. Some summaries are sharp; others miss obvious context. When you go back to the vendor, they reference model tiers and accuracy trade-offs. That is when you realise you never asked the question that mattered: what does “lite” actually mean, and who made that trade-off on your behalf?

The answer, in almost every case, traces to a technique called knowledge distillation. Understanding it will not make you an AI engineer. But it will give you the vocabulary to ask better questions of vendors, choose model tiers deliberately, and know when the cheaper option is genuinely good enough.

What is AI distillation?

Knowledge distillation is when a large AI model, called the teacher, trains a smaller, cheaper model, called the student, to behave in a similar way. The teacher shares its full distribution of confidence across all possible responses, not just a correct answer. That richer signal allows the student to retain much of the teacher’s accuracy at a fraction of the running cost.

The process works in three broad steps. First, the teacher, which may have hundreds of billions of parameters and cost millions in compute to build, processes a large set of examples. For each one, it produces what researchers call “soft targets”: a probability distribution showing not just which answer it prefers, but how confident it is across all possibilities. A model asked whether an email is a complaint might assign 82 per cent confidence to yes, 13 per cent to uncertain, and 5 per cent to no. That nuance is the learning signal.

The student model then trains on those soft targets alongside the original labelled examples. Because soft targets carry far more information than simple right-or-wrong labels, the student can achieve high accuracy while being significantly smaller.

Hinton, Vinyals, and Dean published the foundational version of this technique in 2015. Since then it has become one of the primary routes AI labs take to build cheap, deployable models from their most capable research systems. DistilBERT, one well-documented example, achieved 60 per cent faster inference with 40 per cent fewer parameters at a modest accuracy cost.

Why does it matter for your business?

The main reason distillation matters for owner-managed businesses is cost. Moving from a premium AI model tier to a cheaper distilled version can cut inference costs by 70 to 90 per cent per million tokens. For everyday tasks such as drafting, summarising, and classifying, the quality difference between teacher and student models is often modest while the price difference is substantial.

This shows up in tools you already subscribe to. When you use AI features inside Microsoft 365 or Google Workspace, you are not necessarily getting the largest model in every interaction. You are getting whatever model the vendor judged appropriate for that task at that cost. Distillation is a key reason they can deliver those features at scale.

For businesses choosing model tiers directly, the decision is usually straightforward. Research shows distilled models can retain over 95 per cent of the teacher’s accuracy on language tasks while running considerably faster. For drafting standard communications, summarising reports, or classifying incoming queries, a distilled model will typically serve you well.

Where the calculus changes is in complex or specialised work. A distilled model trained on broad tasks may fall short in a narrow domain where even the teacher was operating close to its limits. That is the use case where a premium tier earns its cost, and the important thing is making that call deliberately rather than by default.

Where will you actually meet distillation?

You will meet distillation most commonly when a platform offers a cheaper, faster model alongside a more capable one. In late 2023, GPT-3.5 Turbo cost roughly one-tenth the price of GPT-4 per thousand tokens, a price gap created in part by distillation techniques. The same pattern runs through Microsoft 365, Google Workspace, and many CRM systems, where AI features rely on a compressed model rather than the full-sized one.

DeepSeek, a Chinese AI lab, demonstrated in early 2025 how far distillation can stretch. Its efficient models matched much of the performance of far larger competitors by applying distillation aggressively, at a fraction of the compute cost. The episode confirmed that distillation is now mainstream production strategy across commercial AI development.

For owner-managed businesses, the three most common encounter points are choosing between model tiers within a platform, where standard versus premium typically reflects a distilled versus full model; using embedded AI inside existing software, where the vendor has already made the model choice; and commissioning custom AI tools on open-source foundations such as the LLaMA family, where distillation is often applied to shrink models for deployment on modest infrastructure.

Vendors rarely volunteer which tier you are on or what the trade-offs involve. Asking is reasonable, expected, and increasingly a sign of a mature buyer.

When does the teacher-student gap actually matter?

The teacher-student gap matters when the task demands nuanced reasoning, handles rare edge cases, or carries real consequences if the model gets it wrong. For routine business work, a well-built student model typically performs well enough. In regulated contexts, such as financial advice or legal document review, accuracy standards are higher and any trade-off a distilled model represents needs to be explicitly assessed and documented.

The ICO’s guidance on AI and data protection makes clear that organisations using AI systems involving personal data carry obligations around lawful basis, transparency, and data subject rights. Distillation itself does not change those duties. It changes the cost and performance profile of the tool, while your obligations as controller remain constant.

In financial services, the FCA expects firms to maintain oversight of AI used in decision-making, including understanding the limitations and trade-offs of the models involved. The Bank of England and FCA joint survey on machine learning found model governance frequently weak in practice, particularly when firms adopt vendor AI without interrogating the underlying model choices.

The NCSC’s guidance on secure AI system development raises a further dimension: if a distilled model is deployed on your own infrastructure rather than a cloud provider’s, your responsibilities for endpoint security, patching, and integrity checks expand accordingly. The efficiency gain of a smaller model comes with a corresponding shift in where the security burden sits.

What sits alongside distillation that’s worth knowing?

Distillation often gets confused with three related techniques: fine-tuning, quantisation, and pruning. Fine-tuning trains an existing model further on a specific dataset without shrinking it. Quantisation stores weights in lower-precision format to reduce memory use. Pruning removes redundant connections from a network. All four make AI more efficient or more specific, but they solve different problems and the distinction matters when you are questioning a vendor.

The CMA’s investigation into foundation model markets found transparency insufficient for downstream business users, including owner-managed businesses, when it comes to understanding which model is in use and what its limitations are. Expect clearer model-tier labelling to become a standard expectation in AI procurement contracts as regulatory pressure increases.

If a vendor proposes a custom solution involving distillation on your own data, three questions matter above all others. What is the base model and who created it? Was your data used in the distillation process, and what data processing agreements govern that? What is the documented accuracy trade-off compared to the teacher, and can you see evidence on tasks relevant to your work?

You do not need to understand the mathematics behind distillation. You need to know that the trade-off was made consciously, with your interests and obligations represented in that decision. If a vendor cannot answer those questions clearly, that is the information you needed.

Sources

- Hinton G, Vinyals O, Dean J (2015). Distilling the Knowledge in a Neural Network. Foundational paper establishing the teacher-student distillation framework used by commercial AI labs. https://arxiv.org/abs/1503.02531 - Sanh V et al. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. Documents 60% faster inference with 40% fewer parameters at modest accuracy cost. https://arxiv.org/abs/1910.01108 - Jiao X et al. (2020). TinyBERT: Distilling BERT for Natural Language Understanding. Further evidence of accuracy retention through distillation at reduced scale. https://arxiv.org/abs/1909.10351 - IBM (2024). What is knowledge distillation? Accessible reference on the distillation process including soft targets and the training steps involved. https://www.ibm.com/think/topics/knowledge-distillation - ICO. Guidance on AI and data protection. Sets out UK organisations' obligations when using AI systems involving personal data, including vendor-built models. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ - NCSC (2023). Guidelines for Secure AI System Development. Covers supply-chain risk and edge-deployment security obligations relevant to businesses deploying distilled models on their own infrastructure. https://www.ncsc.gov.uk/collection/guidelines-secure-ai-system-development - CMA (2023). AI Foundation Models: Initial report. Documents transparency gaps for downstream business users in the foundation model market. https://www.gov.uk/government/publications/ai-foundation-models-initial-report - OpenAI (2023). GPT-4 Technical Report. Documents model tiers and performance characteristics relevant to understanding premium versus standard AI model differences. https://arxiv.org/abs/2303.08774 - Bank of England and FCA (2022). Machine learning in UK financial services. Joint survey findings on model governance weaknesses when firms adopt vendor AI. https://www.bankofengland.co.uk/report/2022/machine-learning-in-uk-financial-services - GaussianWaves (2025). Model distillation explained, and how DeepSeek uses the technique. Documents DeepSeek's use of aggressive distillation to match larger model performance. https://www.gaussianwaves.com/2025/02/model-distillation-explained-how-deepseek-leverages-the-technique-for-ai-success/

Frequently asked questions

What is the difference between a teacher model and a student model in AI?

The teacher is a large, high-performing AI model. The student is a smaller model trained to mimic the teacher's outputs. Rather than just learning from labelled data, the student learns from the teacher's full probability distribution across possible answers, which gives it richer information to work from. The result is a model that runs faster and costs less to operate, with much of the teacher's accuracy preserved.

Do I need to understand distillation to use AI tools in my business?

No technical knowledge is needed. Distillation is a vendor decision, and many owner-managed businesses encounter it indirectly through the model tiers in platforms they already use. What matters is knowing to ask which tier you are on, what the accuracy trade-off looks like on tasks relevant to your business, and whether your data was involved in any training or fine-tuning process.

If a vendor uses a distilled model, does that affect my data protection obligations?

Your data protection obligations remain the same regardless of whether the model is distilled. The ICO's AI guidance makes clear that organisations are responsible for how personal data is used across any AI system, including training and inference. If your data was used to build or adapt the model, you may have obligations around lawful basis and data subject rights. Check the vendor's data processing agreement carefully.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation