How distillation works during AI model training

Two people sitting at a small office table reviewing a printed report together in natural daylight.
TL;DR

Distillation is how a big, expensive AI model trains a smaller cheaper one to behave like it. For a UK owner-operator that means lower per-query cost, the option to host a model on your own infrastructure, and a clean data-protection story when you stop sending live customer data to a hyperscaler. It is not a silver bullet for low-volume or highly specialised work.

Key takeaways

- Distillation is a training method where a smaller "student" model learns to imitate a larger "teacher" model, often inheriting most of its capability at a fraction of the running cost. - On Google's "Distilling Step-by-Step" benchmark, a 770M-parameter student outperformed a 540B-parameter teacher on certain tasks while using only 80% of the available data. - You will rarely see the word "distillation" on a vendor slide; you will see "Turbo", "Lite", "small", "edge-ready", or "on-device", all of which usually mean a distilled variant. - The ICO's AI and data protection guidance still applies when you use a teacher model to label your data; lawful basis, international transfer rules, and DPIA triggers do not disappear because the technique is technical. - Distillation makes most sense for high-volume narrow tasks like email triage or document routing, and least sense for ad-hoc expert reasoning or low-volume work where a hosted flagship is cheaper to run.

A founder I spoke to last month had been quoted two prices for the same chatbot. One was the flagship model at roughly ten pence per query. The other was the vendor’s “Turbo” variant at about one pence. Same vendor, near-identical sales deck, ten times the cost. He wanted to know whether the cheaper one was secretly broken. The honest answer is that the cheaper one is almost certainly distilled, and once you understand what that means, the pricing makes more sense than the slides do.

This post is the plain-English version of that explanation. No maths, no parameter counts unless they earn their place, just what the word actually means, when it works in your favour, and the handful of questions to put to a vendor before you sign.

What is distillation in AI training?

Distillation is a training method where a smaller AI model, the student, learns to behave like a bigger one, the teacher. You feed the teacher inputs, record its answers, and train the student to produce the same answers from the same inputs. IBM describes it as transferring the teacher’s learning, including reasoning steps where possible. The student ends up cheaper to run while keeping much of what made the teacher useful.

The metaphor that holds up is an apprentice watching a senior colleague work. The apprentice does not have the same years behind them, and on the hardest cases they will fall short. On the routine work, which is the majority of the work, they get there for a fraction of the time and cost. That is the trade you are buying when you buy a distilled model.

You will encounter at least three variants of this in the wild. Soft-target distillation, where the student copies the teacher’s probability scores rather than its top answer. Step-by-step distillation, where the teacher produces both an answer and its reasoning and the student learns to do both. And self-distillation, where shallow layers of one model learn from deeper layers of the same model during training and the deeper layers are then thrown away for deployment. The naming is technical; the underlying move is the same.

Why does distillation matter for your business?

It matters because it is the mechanism behind almost every “cheaper version of the same thing” that your AI vendor is selling you. The big models are expensive to run; the distilled versions run on commodity hardware at a fraction of the cost. SabrePC’s vendor-side analysis lists the practical benefits, smaller model size, faster inference, lower latency, reduced cloud and hardware spend, and the ability to deploy on resource-constrained devices.

For an owner-operator, three things follow. Your per-query cost on routine tasks can drop by an order of magnitude if you move from a flagship to a distilled variant. You get the option, depending on the model and the licence, to run the smaller model on your own infrastructure rather than sending every query to a hyperscaler. And you get a cleaner data-protection story, because once your distilled model is trained you can stop sending live customer data to the teacher for labelling.

The flip side is real. A distilled model is not the same as the teacher. On routine work the gap is small; on the harder cases the gap shows up. The trade only pays back when you have enough volume of routine work to justify the cost of training and maintaining the student in the first place. For an SME doing fifty queries a week, the trade does not pay back. For one doing fifty thousand, it usually does.

Where will you actually meet it?

You will rarely see “distillation” on a sales slide. The vendor language is different. “Turbo” or “Lite” variants, “small” or “domain-specific” models trained from a foundation model, “edge-ready” or “on-device” tooling, “fine-tuned for your task” packages where the underlying base is a distilled student. These are usually distillation under another name. OpenAI has confirmed its Turbo variants are distilled from larger models; much of the market has settled into similar language.

You will also meet it in the build-versus-buy conversation with any consultancy proposing a custom AI tool. The economics often only work if a large hosted model is used as a teacher to label your historic data, then a smaller open-source or in-house student takes over the live workload. If a vendor is quoting you a price that seems unusually low for a custom-trained model, the explanation is almost always distillation plus a strong base. Worth knowing so you can ask the right questions about accuracy, licence, and what happens when the teacher model is upgraded.

A third place is regulator-facing documentation. The EU AI Act now imposes specific obligations on general-purpose AI models and on systems built from them, and the CMA in the UK is watching foundation-model concentration closely. If your vendor’s small model is derived from a GPAI provider, those obligations can flow through to you as the deployer. The word may not appear in your contract, but the supply chain still runs through it.

When to ask vs when to ignore

Ask hard when you are buying anything that will process customer data at volume, when you are weighing an on-premise deployment, when a quote looks too cheap to make sense, or when you have a regulated activity where model risk governance already applies. In those cases the distinction between teacher and student changes your data-flow map, your contractual exposure, and what you put in your DPIA.

The questions are not technical. Is this model distilled, and from what teacher? What accuracy on our use-case have you measured against the teacher? Can we run it on our infrastructure or only yours? What happens when the teacher is upgraded by the upstream provider? Are there licence terms that limit what we can do with the outputs? Five questions, all of which a competent vendor can answer in a meeting.

Ignore the question when the volume is low, the workload is varied and ad-hoc, or the value of any single query is high enough that paying the flagship rate per call is the rational choice. A small firm running a few dozen AI-drafted contracts a month does not need to know whether the model is distilled. The training and maintenance overhead of a custom student does not pay back at that volume, and a well-validated flagship is the cleaner choice. Distillation is a high-volume tool. The reasons to care scale with how much routine AI work your business is doing.

Fine-tuning is the close cousin and the one people most often confuse with distillation. Fine-tuning takes a base model and trains it further on your data so it gets better at your specific task. Distillation takes a big model’s behaviour and compresses it into a smaller body. The two are often used together, a distilled student that is then fine-tuned on your labelled data, but they are not the same thing.

Quantisation is the other compression technique you will hear about. Where distillation creates a smaller architecture, quantisation keeps the architecture and reduces the precision of the numbers inside it, which speeds it up and shrinks its memory footprint. A model can be both distilled and quantised; many edge deployments are.

The piece worth holding in your head is that “small model” covers several different decisions. Distillation, quantisation, fine-tuning, and base model selection are all separate choices made by your vendor or your in-house team. When the result is good you should know which of those choices got you there. When it is poor you should know which of them to revisit. Asking the question is what separates an informed buyer from one who is hoping the cheap option works.

If you want to think this through against your own use-case rather than in the abstract, Book a conversation. The right answer depends on volume, data sensitivity, and the regulatory frame you sit inside; an hour is usually enough to know whether distillation is something you should be paying attention to or politely ignoring.

Sources

- IBM (2024). What is Knowledge Distillation? Enterprise-oriented explanation of how a student model inherits a teacher model's learning including reasoning steps. https://www.ibm.com/think/topics/knowledge-distillation - Snorkel AI (2024). LLM distillation demystified, a complete guide. Industry definition of LLM distillation and how it is used to label data for a smaller model. https://snorkel.ai/blog/llm-distillation-demystified-a-complete-guide/ - Google Research (2023). Distilling Step-by-Step, outperforming larger language models with less training data and smaller model sizes. Primary research showing a 770M student beating standard fine-tuning at far smaller dataset sizes. https://research.google/blog/distilling-step-by-step-outperforming-larger-language-models-with-less-training-data-and-smaller-model-sizes/ - Hugging Face (2024). Knowledge distillation, Kseniase. Practical recipe for soft-target distillation and feature-based distillation in production transformer models. https://huggingface.co/blog/Kseniase/kd - SabrePC (2024). Distillation in LLMs, traditional AI and machine learning. Vendor-side view of the cost, latency, and hardware benefits of distilled models for production. https://www.sabrepc.com/blog/deep-learning-and-ai/distillation-in-llms-traditional-ai-and-machine-learning - Information Commissioner's Office. Guidance on AI and data protection. UK GDPR principles for AI training, including lawful basis, purpose limitation, data minimisation, and DPIA triggers. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/ai-and-data-protection/ - NCSC. Machine learning security principles. UK guidance on securing training pipelines, protecting model IP, and risks of edge-deployed and compressed models. https://www.ncsc.gov.uk/collection/machine-learning - Competition and Markets Authority (2023). AI Foundation Models, initial report. UK regulator's view on concentration and lock-in risk in the foundation model market. https://www.gov.uk/government/publications/ai-foundation-models-initial-report - FCA, PRA, Bank of England (2022). Discussion Paper DP5/22, AI and machine learning. UK financial regulators' expectations on data governance, model risk, and accountability for AI systems. https://www.fca.org.uk/publication/discussion/dp5-22.pdf - EU AI Act (2024). Regulation 2024/1689. Obligations on general-purpose AI models and high-risk systems, with documentation and transparency duties that flow through to distilled derivatives. https://eur-lex.europa.eu/eli/reg/2024/1689/oj

Frequently asked questions

Is a distilled AI model less accurate than the original?

Usually yes, but not always by much, and not on every task. Vendors like SabrePC and Hugging Face publish accuracy comparisons showing distilled models keep most of the teacher's capability on routine tasks while losing some performance on specialised reasoning. Google's "Distilling Step-by-Step" research even shows smaller students outperforming much larger teachers on certain benchmarks. The honest answer is: test on your own data before you commit, and treat any vendor accuracy claim as a starting point for your own evaluation.

Do I need permission to distil a commercial model like ChatGPT or Claude?

Check the provider's terms before you start. Some providers explicitly restrict using their outputs to train competing models, others permit it for internal use only, and the terms change frequently. OpenAI's terms of use have evolved several times on this point. If you are paying a teacher model to label data that will train a smaller in-house student, get written confirmation from the provider or pick a model whose terms clearly allow the use. Treat this as a contractual question, not just a technical one.

Does the ICO treat distillation differently from other AI training?

No, the principles are the same. The ICO's AI and data protection guidance applies to any use of personal data for model training, distilled or otherwise. You need a lawful basis, you need to respect purpose limitation, you need to minimise the data you process, and you almost certainly need a DPIA if the training is high-risk. The wrinkle with distillation is the data flow: sending personal data to a third-party teacher model for labelling counts as processing, and if that model is hosted outside the UK or EEA, it is also an international transfer under Chapter V UK GDPR.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation