What is model quantisation and how does it affect AI deployment costs?

Two people reviewing technical information on a laptop screen at an office desk with notebooks open
TL;DR

Model quantisation reduces the bit-width of an AI model's stored numbers, making it smaller and cheaper to run. For owner-managed businesses, it matters if you self-host open-source models, run high-volume internal tools, or manage your own GPU infrastructure. If you rely on SaaS AI features or modest API calls to hosted providers, the vendors handle all of this on their side.

Key takeaways

- Model quantisation re-encodes a model's internal numbers from 32-bit or 16-bit precision to 8-bit or 4-bit integers, making the model file smaller and inference faster. - Moving to 8-bit quantisation typically produces a model about four times smaller and twice as fast, with roughly a five to ten per cent drop in benchmark accuracy for many language tasks. - Quantisation matters primarily if you self-host open-source models or build high-volume internal AI tools on your own infrastructure; for SaaS AI and modest API volumes, vendors handle it. - UK and EU regulations treat quantisation as a model change: ICO guidance requires documentation of how technical decisions affect accuracy and fairness, and NCSC guidance treats model files as software assets requiring access controls. - The practical starting point is to pick well-maintained 8-bit or quality 4-bit model variants from a reputable source, run fifty real business prompts through them, and compare outputs against a full-precision reference before committing.

A founder running a small legal firm recently looked at the bill for a cloud AI service they’d been running internally for six months. The costs were climbing faster than the value. Their AI consultant mentioned three options: switch providers, cut usage, or look at running a lighter version of the model. That third option came with a word the founder hadn’t heard before: quantisation.

Many owners at the API-only stage never need to know this word. But if you’re self-hosting a model, building a high-volume internal tool, or trying to contain cloud compute costs, quantisation is the lever you haven’t pulled yet.

What is model quantisation?

Quantisation means storing a model’s numbers with fewer bits. A standard AI model stores its internal values as 32-bit or 16-bit floating-point numbers. Quantisation re-encodes those values as 8-bit or 4-bit integers. The model file shrinks, each calculation costs less to run, and the model still produces usable output for many everyday tasks.

Think of it as a trade-off between precision and resource use. When you train an AI model, it builds a vast table of numbers called weights. These weights represent what the model has learned. At full precision, storing each weight takes 32 bits of memory. Move to 8 bits and you’re using a quarter of the storage for each number.

IBM describes quantisation as reducing the precision of digital values to cut compute and memory use. Symbl.ai notes that going from 32-bit to 8-bit or 4-bit “significantly decreases” model size and lets models run on a single GPU or even a standard CPU. The ngrok engineering team measured the practical result: models can run roughly four times smaller and twice as fast, with around a five to ten per cent drop in accuracy for many language tasks.

The precision loss is real, but for everyday tasks like summarising case notes, answering policy questions, or classifying customer emails, a small trade-off is often acceptable.

Why does this affect what you pay to run AI?

Running an AI model costs money in two ways: the memory needed to hold the model in a chip, and the computation needed to generate each answer. A quantised model needs less of both. For high-volume internal tools or any setup where you’re paying by GPU hour, the difference between running a full-precision and a quantised model can be significant.

NVIDIA describes quantisation as a core technique to reduce memory bandwidth and latency on GPUs, where memory access is a consistent bottleneck. CAST AI’s analysis of production LLM deployments shows that quantisation, combined with other optimisation steps, can materially reduce cloud GPU spend for always-on endpoints.

For a five to fifty person owner-managed business, the business case works like this. If you’re self-hosting a model for a high-volume repeat task, such as summarising hundreds of client calls a day or running document searches across a shared drive, a quantised model running on a mid-range GPU workstation may cost less per month than renting a cloud multi-GPU instance. Meegle’s guide to AI infrastructure notes that quantisation lets businesses deploy AI on edge devices and modest hardware, cutting costs for smaller operators.

If you’re paying for cloud AI through an API, the quantisation decisions are made inside the provider’s infrastructure. Your levers are model selection and prompt design, not bit-width.

Where will you actually encounter quantisation?

Quantisation shows up most visibly in the open-source model ecosystem. If you or a technical hire download a language model from Hugging Face to run locally, the format you choose often encodes the bit-width. GGUF files, one of the common formats for running models on a laptop or a local server, are built around quantised weights.

Meta’s LLaMA models are published with quantised variants, and the community has built extensive quantised versions of Mistral, Phi-3, and other open-source models. NVIDIA’s TensorRT-LLM stack provides automated quantisation pipelines to reduce inference latency and cost. Hugging Face hosts thousands of quantised models in GGUF, GPTQ, and AWQ formats with straightforward documentation for loading them.

If a technical team or AI consultant is building an internal tool using an open-source model, selecting a quantised variant is a routine step. For a UK services firm where client data stays on-premises for FCA, contractual, or data residency reasons, this is where quantisation becomes a practical question rather than an abstract one.

A field services business deploying an AI assistant in a mobile app for offline note-taking is another context where it appears. Quantisation lets a capable model fit inside a smartphone or tablet without requiring a live cloud connection.

When does quantisation matter for your business, and when should you ignore it?

The answer depends on whether you self-host models. If you rely entirely on SaaS AI features in Microsoft 365, Salesforce, or similar platforms, or if you make modest API calls to hosted providers, quantisation is a decision for the provider’s engineers, not yours. If you run models yourself, it becomes directly relevant.

You’re more likely to benefit from understanding quantisation if you’re building a repeat-use internal assistant on open-source models, if you have data residency or confidentiality requirements that push you toward on-premises deployment, or if you want predictable costs for a high-volume service rather than a usage-based bill.

You can safely deprioritise it if your AI use is still at the experiment stage, if your team is relying on tools like ChatGPT or Microsoft Copilot, or if your main constraints are data quality, workflow design, or staff adoption rather than compute cost.

One counterpoint worth naming: for tasks where accuracy is important, such as complex regulatory analysis or financial modelling, even a five per cent quality drop from aggressive quantisation may be unacceptable. In those cases, a full-precision hosted model may be the better fit, even at higher cost.

Governance matters here too. The UK ICO’s AI guidance makes clear that AI systems processing personal data must be accurate, fair, and secure, and organisations must document how technical decisions affect those properties. A switch from a full-precision model to an aggressively quantised one is a model change that should be logged, tested against your actual tasks, and noted in your data protection impact assessment. NCSC guidance treats model files as software assets requiring access control and change management.

For regulated financial services firms, the Bank of England’s model risk management principles and FCA expectations mean that any quantisation change affecting a model used in decision-making needs validation and sign-off, not just a deployment update.

What else should you know alongside quantisation?

Quantisation sits alongside several other techniques for making AI cheaper and faster to run. Understanding the vocabulary helps you ask better questions of any technical hire or AI vendor. Knowledge distillation, pruning, and ONNX conversion all aim at similar goals: smaller, faster, cheaper AI without losing too much of what makes the model useful.

Knowledge distillation trains a smaller model to mimic a larger one. Pruning removes the weights that contribute least to the model’s output, reducing model size without re-encoding numbers. ONNX (Open Neural Network Exchange) is a format standard that makes models more portable across hardware environments and often enables additional optimisation at deployment time.

Two main approaches to quantisation itself are worth knowing. Post-training quantisation is applied after the training process is complete; it’s quicker to apply but can reduce accuracy more at lower bit-widths. Quantisation-aware training bakes quantisation into the training process, which typically produces better accuracy at the cost of more engineering effort.

PyTorch, TensorFlow, and NVIDIA TensorRT all ship quantisation toolchains. For a business owner, the practical point is to start with already-quantised models from reputable publishers, run a small evaluation set of perhaps twenty to fifty real prompts from your business, and compare outputs against a full-precision reference. That comparison tells you whether the accuracy trade-off is acceptable for your specific tasks.

If you need to discuss AI deployment costs with a vendor or technical consultant, asking whether quantised models are in scope and what bit-width they recommend for your task type is a straightforward, informed question.

Sources

- IBM (2024). "Model Quantization." IBM Think. Defines quantisation as reducing the precision of digital values (FP32 to INT8) to cut compute and memory use, with a plain-English explanation of how weights are re-encoded. https://www.ibm.com/think/topics/quantization - ngrok engineering team (October 2023). "Quantization." Reports empirical findings: 4x size reduction and roughly 2x speed-up with a 5 to 10 per cent accuracy loss for well-implemented low-bit LLM quantisation. https://ngrok.com/blog/quantization - Symbl.ai (2024). "A Guide to Quantization in LLMs." Explains how INT8 and INT4 reduce model memory and compute, enabling deployment on a single GPU or CPU, with a structured overview of PTQ and QAT approaches. https://symbl.ai/developers/blog/a-guide-to-quantization-in-llms/ - NVIDIA (May 2023). "Model Quantization: Concepts, Methods, and Why It Matters." Documents INT8 quantisation frameworks for production GPU deployments, with guidance on calibration data and activation handling. https://developer.nvidia.com/blog/model-quantization-concepts-methods-and-why-it-matters/ - CAST AI (2024). "Demystifying Quantization in LLMs." Explains how quantisation combined with other optimisation steps can materially reduce cloud GPU spend for always-on LLM endpoints. https://cast.ai/blog/demystifying-quantizations-llms/ - ICO (December 2023). "ICO updates guidance on artificial intelligence and data protection." Clarifies that AI systems must be accurate, fair, and secure; organisations must document how technical decisions, including model changes, affect those properties. https://ico.org.uk/about-the-ico/media-centre/news-and-blogs/2023/12/ico-updates-guidance-on-artificial-intelligence-and-data-protection/ - NCSC. "Machine Learning." NCSC guidance that model artefacts, including quantised models, must be treated as sensitive software assets requiring access controls, patching, and change management. https://www.ncsc.gov.uk/collection/machine-learning - FCA. "AI in financial services." Sets out FCA expectations for governance, validation, and monitoring of models used in decision-making, applicable when quantisation changes a model used in regulated contexts. https://www.fca.org.uk/news/speeches/ai-financial-services - Bank of England / PRA (July 2023). "Model Risk Management Principles for Banks (PS6/23)." Sets out model risk governance standards including validation and sign-off requirements for model changes. https://www.bankofengland.co.uk/prudential-regulation/publication/2023/july/model-risk-management-principles-for-banks-ps6-23 - EU AI Act (2024). Regulation (EU) 2024/1689. Imposes risk-management and technical-documentation obligations on providers and deployers of AI systems, covering model-level changes such as quantisation for high-risk use cases. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689

Frequently asked questions

What is the difference between 8-bit and 4-bit quantisation for language models?

Eight-bit quantisation (INT8) typically preserves accuracy very close to full 16-bit precision for everyday language tasks. Four-bit quantisation goes further, shrinking the model more and running faster, but requires careful implementation to avoid noticeable quality loss. The ngrok engineering team measured around a five to ten per cent accuracy reduction on benchmarks for well-implemented 4-bit methods. Start with 8-bit or high-quality 4-bit variants from established sources and test against your real prompts before committing.

Does model quantisation affect AI accuracy for tasks like summarising documents or answering queries?

For many everyday business tasks, the accuracy trade-off from 8-bit quantisation is small enough to be acceptable. Summarising case notes, classifying emails, and answering policy questions tend to hold up well with a modest quality reduction. Tasks requiring fine-grained reasoning, precise numerical analysis, or high-stakes decisions may be more sensitive. The practical approach is to run fifty real examples from your business through both the full-precision model and the quantised version, then compare outputs before deploying.

If I only use ChatGPT or Microsoft Copilot, do I need to think about model quantisation at all?

No. When you use hosted AI services like ChatGPT, Microsoft Copilot, or Google Gemini via their standard interfaces or APIs, the providers manage all infrastructure decisions, including quantisation. You pay for output, not for compute you control. Quantisation becomes a decision you own only when you're self-hosting an open-source model, building your own inference server, or managing GPU infrastructure directly. That is when it is directly relevant to your costs and deployment options.

This post is general information and education only, not legal, regulatory, financial, or other professional advice. Regulations evolve, fee benchmarks shift, and every situation is different, so please take qualified professional advice before acting on anything you read here. See the Terms of Use for the full position.

Ready to talk it through?

Book a free 30 minute conversation. No pitch, no pressure, just a useful chat about where AI fits in your business.

Book a conversation

Related reading

If any of this sounds familiar, let's talk.

The next step is a conversation. No pitch, no pressure. Just an honest discussion about where you are and whether I can help.

Book a conversation