What is knowledge distillation? A plain-English guide

You’re in a demo with an AI vendor. The tool runs on your own servers, no data leaving your network. You ask what computing power that takes. They say the model is “lighter”, “optimised”, “tailored for on-premise use”. What they usually mean, without saying it, is that someone has distilled a large AI model into a far smaller one that runs cheaply on modest hardware. That process is called knowledge distillation.

What is knowledge distillation?

Knowledge distillation compresses a large AI model into a much smaller one while keeping most of its performance. A big “teacher” model is trained first, then used to guide training of a compact “student” model that learns to approximate its outputs. The result handles most of the cognitive work of the original at a fraction of the computing cost.

The idea was formalised by Geoffrey Hinton and colleagues in a 2015 paper showing that a small network trained on the probability outputs of a large neural ensemble could reach near-equivalent accuracy. Rather than training the small model from scratch on raw data, you train it to mimic the teacher’s behaviour on those same inputs. The student learns not just the right answers but the teacher’s uncertainty and nuance.

The best-known practical result is DistilBERT, published by researchers at Hugging Face in 2019. DistilBERT uses 40 per cent fewer parameters than its teacher model BERT and runs 60 per cent faster, while retaining 97 per cent of BERT’s language understanding on standardised benchmarks. That trade-off, very close to the original at sharply lower cost, is the promise distillation makes, and it’s why the technique has spread from academic research into commercial AI products.

Why does this matter for your business?

For owner-managed businesses using commercial AI tools, knowledge distillation is invisible. Microsoft, Google, Anthropic and others compress and optimise their models before you encounter them. What distillation matters for is the decision to run a model yourself, whether to cut cloud API costs at scale, to keep sensitive data on your own infrastructure, or to achieve the response speeds a client-facing product requires.

Running AI through an external API is simple and suitable for the large majority of businesses at early to mid stages of adoption. API costs become a material question only when you’re running tens of thousands of interactions a month. Data control becomes a concern when you’re processing documents you genuinely cannot route through a US-hosted service, even with a data processing agreement, for example clinical records, legal files, or financial data under specific regulatory obligations.

When either of those pressures applies, a distilled model running on your own server becomes worth evaluating. The Samsung incident in 2023 illustrated the pattern well. Engineers used a public AI service for internal work, sensitive information was inadvertently shared, and the firm moved quickly toward self-hosted options. The solution they explored, smaller models running on internal infrastructure, is precisely where distillation fits.

Where will you actually encounter it?

Knowledge distillation comes up most frequently in vendor conversations and in the open-source AI community. When a vendor describes their tool as “lightweight” or “on-premise-ready”, distillation is commonly part of how that was achieved. Open-source examples include DistilBERT and the smaller variants of Meta’s Llama family, designed to run efficiently on a single server rather than requiring an expensive GPU cluster.

Hugging Face hosts a public library of distilled models. Meta’s Llama 3 family includes an 8-billion-parameter variant aimed at efficient single-GPU deployment, compared to the 70-billion-parameter version that needs substantially more hardware. These are production-grade tools that technical teams at owner-managed and mid-size businesses are actively using in 2026. The CMA’s 2024 review of AI foundation models noted that access to the leading models is concentrated in a small number of firms, and open-source distilled models are part of the broader market response to that concentration.

When you hear a vendor claim their AI runs “locally”, it is worth verifying what that means in practice. Some vendors use distillation and quantisation to shrink a genuinely capable model to local scale. Others use a model that was always small and less capable, which runs locally by default but may not perform at the level you need. Asking for benchmark numbers on accuracy and latency, alongside confirmation of the underlying model, will quickly separate the two.

When should you ask about this, and when can you ignore it?

Skip this topic if you are still working out whether AI adds value to your workflows. At that stage, you need proof of concept, not infrastructure decisions. Skip it too if your usage is light and your API costs are modest. Knowledge distillation becomes worth exploring only once you are committing to serious integration and genuinely need to control where the model runs.

The conditions that tip the balance are reasonably specific. You need a data-sensitivity constraint that prevents external API use, a latency requirement that rules out remote models, or a usage volume where cloud API costs exceed the one-off cost of building and running your own model over a 12 to 36-month horizon.

Without in-house or partner technical capability, none of those conditions can be acted on. Distillation and model deployment require machine learning engineers who can run training jobs, evaluate output quality, and maintain the system over time. For a business without that access, the operational cost of self-hosting tends to outweigh the benefit. The right response to hearing about distillation in a vendor conversation is to ask what model it is distilled from, what accuracy benchmarks it reaches, and how the student model is maintained as the field evolves.

How does knowledge distillation connect to other AI concepts?

Knowledge distillation is one of three model-compression approaches you’re likely to encounter when smaller, faster AI models come up. The other two are quantisation, which reduces the numerical precision of model weights to cut memory requirements, and pruning, which removes parameters that contribute least to performance. Knowing the difference helps you ask better questions of vendors and make cleaner decisions about your own infrastructure.

Quantisation is simpler and cheaper to apply. You don’t retrain the model; you convert it to lower-precision arithmetic, typically from 32-bit to 4-bit or 8-bit numbers. That alone can cut memory requirements several times over with modest accuracy loss. Modern file formats have made this accessible to technical teams without deep machine learning expertise, and it’s one reason smaller open-source models can run on a powerful laptop.

Fine-tuning is related but distinct. Where distillation transfers general capability from teacher to student, fine-tuning adapts an existing model to perform better on a specific domain using your own data. The two work well together. Fine-tune a distilled model and you end up with one that is both smaller and domain-specific.

None of these technical choices changes your regulatory obligations. The ICO’s guidance on AI and data protection is clear that organisations remain data controllers for personal data fed into AI systems, regardless of model architecture. The EU AI Act classifies systems by use case and deployment context, not compression technique. If you’re training or fine-tuning on personal data, a Data Protection Impact Assessment is likely required, regardless of whether the model was distilled, quantised, or full-size.

What is knowledge distillation? Why it matters for your business

Key takeaways