A founder running a small legal firm recently looked at the bill for a cloud AI service they’d been running internally for six months. The costs were climbing faster than the value. Their AI consultant mentioned three options: switch providers, cut usage, or look at running a lighter version of the model. That third option came with a word the founder hadn’t heard before: quantisation.
Many owners at the API-only stage never need to know this word. But if you’re self-hosting a model, building a high-volume internal tool, or trying to contain cloud compute costs, quantisation is the lever you haven’t pulled yet.
What is model quantisation?
Quantisation means storing a model’s numbers with fewer bits. A standard AI model stores its internal values as 32-bit or 16-bit floating-point numbers. Quantisation re-encodes those values as 8-bit or 4-bit integers. The model file shrinks, each calculation costs less to run, and the model still produces usable output for many everyday tasks.
Think of it as a trade-off between precision and resource use. When you train an AI model, it builds a vast table of numbers called weights. These weights represent what the model has learned. At full precision, storing each weight takes 32 bits of memory. Move to 8 bits and you’re using a quarter of the storage for each number.
IBM describes quantisation as reducing the precision of digital values to cut compute and memory use. Symbl.ai notes that going from 32-bit to 8-bit or 4-bit “significantly decreases” model size and lets models run on a single GPU or even a standard CPU. The ngrok engineering team measured the practical result: models can run roughly four times smaller and twice as fast, with around a five to ten per cent drop in accuracy for many language tasks.
The precision loss is real, but for everyday tasks like summarising case notes, answering policy questions, or classifying customer emails, a small trade-off is often acceptable.
Why does this affect what you pay to run AI?
Running an AI model costs money in two ways: the memory needed to hold the model in a chip, and the computation needed to generate each answer. A quantised model needs less of both. For high-volume internal tools or any setup where you’re paying by GPU hour, the difference between running a full-precision and a quantised model can be significant.
NVIDIA describes quantisation as a core technique to reduce memory bandwidth and latency on GPUs, where memory access is a consistent bottleneck. CAST AI’s analysis of production LLM deployments shows that quantisation, combined with other optimisation steps, can materially reduce cloud GPU spend for always-on endpoints.
For a five to fifty person owner-managed business, the business case works like this. If you’re self-hosting a model for a high-volume repeat task, such as summarising hundreds of client calls a day or running document searches across a shared drive, a quantised model running on a mid-range GPU workstation may cost less per month than renting a cloud multi-GPU instance. Meegle’s guide to AI infrastructure notes that quantisation lets businesses deploy AI on edge devices and modest hardware, cutting costs for smaller operators.
If you’re paying for cloud AI through an API, the quantisation decisions are made inside the provider’s infrastructure. Your levers are model selection and prompt design, not bit-width.
Where will you actually encounter quantisation?
Quantisation shows up most visibly in the open-source model ecosystem. If you or a technical hire download a language model from Hugging Face to run locally, the format you choose often encodes the bit-width. GGUF files, one of the common formats for running models on a laptop or a local server, are built around quantised weights.
Meta’s LLaMA models are published with quantised variants, and the community has built extensive quantised versions of Mistral, Phi-3, and other open-source models. NVIDIA’s TensorRT-LLM stack provides automated quantisation pipelines to reduce inference latency and cost. Hugging Face hosts thousands of quantised models in GGUF, GPTQ, and AWQ formats with straightforward documentation for loading them.
If a technical team or AI consultant is building an internal tool using an open-source model, selecting a quantised variant is a routine step. For a UK services firm where client data stays on-premises for FCA, contractual, or data residency reasons, this is where quantisation becomes a practical question rather than an abstract one.
A field services business deploying an AI assistant in a mobile app for offline note-taking is another context where it appears. Quantisation lets a capable model fit inside a smartphone or tablet without requiring a live cloud connection.
When does quantisation matter for your business, and when should you ignore it?
The answer depends on whether you self-host models. If you rely entirely on SaaS AI features in Microsoft 365, Salesforce, or similar platforms, or if you make modest API calls to hosted providers, quantisation is a decision for the provider’s engineers, not yours. If you run models yourself, it becomes directly relevant.
You’re more likely to benefit from understanding quantisation if you’re building a repeat-use internal assistant on open-source models, if you have data residency or confidentiality requirements that push you toward on-premises deployment, or if you want predictable costs for a high-volume service rather than a usage-based bill.
You can safely deprioritise it if your AI use is still at the experiment stage, if your team is relying on tools like ChatGPT or Microsoft Copilot, or if your main constraints are data quality, workflow design, or staff adoption rather than compute cost.
One counterpoint worth naming: for tasks where accuracy is important, such as complex regulatory analysis or financial modelling, even a five per cent quality drop from aggressive quantisation may be unacceptable. In those cases, a full-precision hosted model may be the better fit, even at higher cost.
Governance matters here too. The UK ICO’s AI guidance makes clear that AI systems processing personal data must be accurate, fair, and secure, and organisations must document how technical decisions affect those properties. A switch from a full-precision model to an aggressively quantised one is a model change that should be logged, tested against your actual tasks, and noted in your data protection impact assessment. NCSC guidance treats model files as software assets requiring access control and change management.
For regulated financial services firms, the Bank of England’s model risk management principles and FCA expectations mean that any quantisation change affecting a model used in decision-making needs validation and sign-off, not just a deployment update.
What else should you know alongside quantisation?
Quantisation sits alongside several other techniques for making AI cheaper and faster to run. Understanding the vocabulary helps you ask better questions of any technical hire or AI vendor. Knowledge distillation, pruning, and ONNX conversion all aim at similar goals: smaller, faster, cheaper AI without losing too much of what makes the model useful.
Knowledge distillation trains a smaller model to mimic a larger one. Pruning removes the weights that contribute least to the model’s output, reducing model size without re-encoding numbers. ONNX (Open Neural Network Exchange) is a format standard that makes models more portable across hardware environments and often enables additional optimisation at deployment time.
Two main approaches to quantisation itself are worth knowing. Post-training quantisation is applied after the training process is complete; it’s quicker to apply but can reduce accuracy more at lower bit-widths. Quantisation-aware training bakes quantisation into the training process, which typically produces better accuracy at the cost of more engineering effort.
PyTorch, TensorFlow, and NVIDIA TensorRT all ship quantisation toolchains. For a business owner, the practical point is to start with already-quantised models from reputable publishers, run a small evaluation set of perhaps twenty to fifty real prompts from your business, and compare outputs against a full-precision reference. That comparison tells you whether the accuracy trade-off is acceptable for your specific tasks.
If you need to discuss AI deployment costs with a vendor or technical consultant, asking whether quantised models are in scope and what bit-width they recommend for your task type is a straightforward, informed question.


