A founder I spoke to last month had been quoted two prices for the same chatbot. One was the flagship model at roughly ten pence per query. The other was the vendor’s “Turbo” variant at about one pence. Same vendor, near-identical sales deck, ten times the cost. He wanted to know whether the cheaper one was secretly broken. The honest answer is that the cheaper one is almost certainly distilled, and once you understand what that means, the pricing makes more sense than the slides do.
This post is the plain-English version of that explanation. No maths, no parameter counts unless they earn their place, just what the word actually means, when it works in your favour, and the handful of questions to put to a vendor before you sign.
What is distillation in AI training?
Distillation is a training method where a smaller AI model, the student, learns to behave like a bigger one, the teacher. You feed the teacher inputs, record its answers, and train the student to produce the same answers from the same inputs. IBM describes it as transferring the teacher’s learning, including reasoning steps where possible. The student ends up cheaper to run while keeping much of what made the teacher useful.
The metaphor that holds up is an apprentice watching a senior colleague work. The apprentice does not have the same years behind them, and on the hardest cases they will fall short. On the routine work, which is the majority of the work, they get there for a fraction of the time and cost. That is the trade you are buying when you buy a distilled model.
You will encounter at least three variants of this in the wild. Soft-target distillation, where the student copies the teacher’s probability scores rather than its top answer. Step-by-step distillation, where the teacher produces both an answer and its reasoning and the student learns to do both. And self-distillation, where shallow layers of one model learn from deeper layers of the same model during training and the deeper layers are then thrown away for deployment. The naming is technical; the underlying move is the same.
Why does distillation matter for your business?
It matters because it is the mechanism behind almost every “cheaper version of the same thing” that your AI vendor is selling you. The big models are expensive to run; the distilled versions run on commodity hardware at a fraction of the cost. SabrePC’s vendor-side analysis lists the practical benefits, smaller model size, faster inference, lower latency, reduced cloud and hardware spend, and the ability to deploy on resource-constrained devices.
For an owner-operator, three things follow. Your per-query cost on routine tasks can drop by an order of magnitude if you move from a flagship to a distilled variant. You get the option, depending on the model and the licence, to run the smaller model on your own infrastructure rather than sending every query to a hyperscaler. And you get a cleaner data-protection story, because once your distilled model is trained you can stop sending live customer data to the teacher for labelling.
The flip side is real. A distilled model is not the same as the teacher. On routine work the gap is small; on the harder cases the gap shows up. The trade only pays back when you have enough volume of routine work to justify the cost of training and maintaining the student in the first place. For an SME doing fifty queries a week, the trade does not pay back. For one doing fifty thousand, it usually does.
Where will you actually meet it?
You will rarely see “distillation” on a sales slide. The vendor language is different. “Turbo” or “Lite” variants, “small” or “domain-specific” models trained from a foundation model, “edge-ready” or “on-device” tooling, “fine-tuned for your task” packages where the underlying base is a distilled student. These are usually distillation under another name. OpenAI has confirmed its Turbo variants are distilled from larger models; much of the market has settled into similar language.
You will also meet it in the build-versus-buy conversation with any consultancy proposing a custom AI tool. The economics often only work if a large hosted model is used as a teacher to label your historic data, then a smaller open-source or in-house student takes over the live workload. If a vendor is quoting you a price that seems unusually low for a custom-trained model, the explanation is almost always distillation plus a strong base. Worth knowing so you can ask the right questions about accuracy, licence, and what happens when the teacher model is upgraded.
A third place is regulator-facing documentation. The EU AI Act now imposes specific obligations on general-purpose AI models and on systems built from them, and the CMA in the UK is watching foundation-model concentration closely. If your vendor’s small model is derived from a GPAI provider, those obligations can flow through to you as the deployer. The word may not appear in your contract, but the supply chain still runs through it.
When to ask vs when to ignore
Ask hard when you are buying anything that will process customer data at volume, when you are weighing an on-premise deployment, when a quote looks too cheap to make sense, or when you have a regulated activity where model risk governance already applies. In those cases the distinction between teacher and student changes your data-flow map, your contractual exposure, and what you put in your DPIA.
The questions are not technical. Is this model distilled, and from what teacher? What accuracy on our use-case have you measured against the teacher? Can we run it on our infrastructure or only yours? What happens when the teacher is upgraded by the upstream provider? Are there licence terms that limit what we can do with the outputs? Five questions, all of which a competent vendor can answer in a meeting.
Ignore the question when the volume is low, the workload is varied and ad-hoc, or the value of any single query is high enough that paying the flagship rate per call is the rational choice. A small firm running a few dozen AI-drafted contracts a month does not need to know whether the model is distilled. The training and maintenance overhead of a custom student does not pay back at that volume, and a well-validated flagship is the cleaner choice. Distillation is a high-volume tool. The reasons to care scale with how much routine AI work your business is doing.
Related concepts worth knowing
Fine-tuning is the close cousin and the one people most often confuse with distillation. Fine-tuning takes a base model and trains it further on your data so it gets better at your specific task. Distillation takes a big model’s behaviour and compresses it into a smaller body. The two are often used together, a distilled student that is then fine-tuned on your labelled data, but they are not the same thing.
Quantisation is the other compression technique you will hear about. Where distillation creates a smaller architecture, quantisation keeps the architecture and reduces the precision of the numbers inside it, which speeds it up and shrinks its memory footprint. A model can be both distilled and quantised; many edge deployments are.
The piece worth holding in your head is that “small model” covers several different decisions. Distillation, quantisation, fine-tuning, and base model selection are all separate choices made by your vendor or your in-house team. When the result is good you should know which of those choices got you there. When it is poor you should know which of them to revisit. Asking the question is what separates an informed buyer from one who is hoping the cheap option works.
If you want to think this through against your own use-case rather than in the abstract, Book a conversation. The right answer depends on volume, data sensitivity, and the regulatory frame you sit inside; an hour is usually enough to know whether distillation is something you should be paying attention to or politely ignoring.



