What is model quantization? Plain-English guide for owners

An IT lead at a manufacturing client called me about a vendor proposal. The vendor wanted to install a £45,000 GPU rig in his comms cupboard. Two pages later the same proposal mentioned a “Q4 quantized” version of the model that would run on a single £5,000 GPU. He asked which one he actually needed. The second number was the real one.

By 2026 quantization is what makes self-hosted and edge AI affordable, and it is the term that decides whether on-premise AI is sensible or a vanity hardware purchase. Worth knowing if anyone is proposing to put AI inside your building.

What is quantization?

Quantization is a compression technique that shrinks an AI model by storing the numbers inside it in lower precision. A model trained in 32-bit floating-point (called FP32) holds each internal parameter as a high-precision number. Quantization converts those numbers to lower-precision integers, typically 8-bit (Q8) or 4-bit (Q4). A 32-bit number becomes a 4-bit number. The model gets roughly four times smaller, runs two to three times faster, and uses much less memory.

The reason it works is that neural networks have built-in tolerance for small numerical noise. They are trained to recognise patterns and ignore minor perturbations, which means they cope with the rounding error that quantization introduces. The accuracy loss on standard benchmarks is small. Research published in 2024 and 2025 shows four-bit quantization typically retains 97% to 99% of the full-precision model’s performance, and eight-bit retains essentially all of it.

The trade-off is not even across tasks. Quantization hurts most where small errors compound, long chains of mathematical reasoning, code generation, multi-step analytical problems. It hurts least on classification, summarisation and dialogue. For a UK service business deploying an AI tool for customer Q&A or internal document search, the loss is usually invisible. For a financial-modelling tool, it is not.

You will see three named methods in vendor pitches. GPTQ is a post-training quantization technique optimised for GPU inference. AWQ preserves the most important weights more carefully, at slightly higher quality. GGUF is a binary format designed to run quantized models on CPUs or Apple Silicon, useful for offline and laptop deployment. The choice between them affects which hardware you need.

Why it matters for your business

The first reason is hardware cost. Running a 70-billion-parameter model in full precision typically needs two or more premium GPUs costing £25,000 or more, plus power, cooling and rack space. The same model quantized to four-bit fits on a single mid-range GPU costing roughly £5,000. For an SME weighing on-premise AI, that one decision separates a serious capital purchase from a sensible IT line item.

The second reason is operational cost. Cloud APIs charge per token of input and output. For a business processing tens of millions of tokens a month, the cumulative bill becomes meaningful. A quantized model on your own hardware replaces that variable cost with a fixed one. The cross-over point is usually above 50 million tokens a month.

The third reason is data residency. Cloud APIs send your input data to the provider’s infrastructure. The ICO’s 2026 guidance on international transfers requires a lawful basis and contractual safeguards for transfers outside the UK. Local inference on quantized models keeps the data inside your boundary and removes the audit work that comes with international transfers. The NCSC’s 2026 guidance on edge devices treats local AI as compatible with responsible adoption, provided the device is properly secured.

The fourth reason is latency. Cloud APIs add 200 to 2,000 milliseconds of network round-trip per call. Local inference on a quantized model can respond in under 100. For real-time customer interactions or industrial sensor processing, that difference is what makes the use case work.

Where you will meet it

You will meet quantization, often without the word, in any pitch that promises “runs on your device”, “edge AI”, “fully on-premise” or “no data leaves your network”. All of those claims depend on quantization underneath. A full-precision model is too large to run sensibly on a laptop, a single GPU or an edge box. The vendor either knows quantization is in play and is just sparing you the term, or they are over-promising on hardware.

You will also meet it in the Q-notation. Q4, Q5, Q8 are precision levels. Q4 is the 2026 default for cost-optimised deployment. Q8 is the choice when you have GPU headroom and want minimal accuracy loss. Q2 and Q3 are aggressive compression rarely sensible for production. If the vendor’s spec sheet uses these labels without explanation, ask which level and why.

You will meet it in the named methods too. GGUF in the spec sheet is a strong signal that the vendor expects you to run inference on a CPU or an Apple-silicon machine, often a laptop or an office workstation. GPTQ or AWQ signals GPU deployment. The shape of your hardware purchase follows from this choice and it is worth getting clear before the invoice arrives.

The most useful place to meet the term is in the comparison conversation. A vendor proposing on-premise AI should be able to show you the per-query cost on quantized hardware versus the equivalent cloud API at your expected volume. If they cannot, they have not done the maths and you are buying capacity blind.

When to ask about it, when to ignore it

Ask about quantization when you are considering self-hosting any AI capability, or deploying at the edge, or running AI in a regulated environment where data must stay local. In all three cases the question to put to the vendor is “what precision is the deployment, what method, and what is the validated accuracy on tasks like mine?” Then “show me the hardware spec and the cost-per-query maths versus the equivalent cloud API.”

Ask about it harder when accuracy on your specific task matters more than speed or cost. Financial decision support, clinical triage, legal interpretation. The standard 1% to 3% accuracy loss at four-bit is usually fine for general tasks but can be material when each prediction has a real consequence. The FCA’s published approach to AI is explicit that firms must understand their models’ behaviour and explain decisions. A quantized model deserves the same validation as a full-precision one before you sign off.

Ignore the term when you are buying a cloud API like GPT, Claude or Gemini. The provider has already optimised their infrastructure, and quantization is not a knob you control. Focus on price per token, latency and quality. Asking the cloud vendor about their quantization approach is technically interesting and procurement-irrelevant.

Ignore it too when the use case is low-volume and exploratory. A team trialling AI for a couple of hours a week through ChatGPT does not need to think about model precision. Quantization becomes a procurement question when your volumes justify owning the infrastructure rather than renting it.

Distillation is a different way of making a model smaller. Quantization keeps the same architecture and lowers the precision of the numbers; distillation trains a smaller new model to imitate a larger one. Llama 3.1 8B is partly distilled from the 405B version. Distillation can compress further than quantization but needs expensive retraining.

Pruning is a third compression technique that removes the weights and neurons that contribute least. Like distillation it usually requires retraining. Research in 2025 and 2026 finds quantization beats pruning for everyday compression and pruning only catches up at extreme ratios.

LoRA (Low-Rank Adaptation) is a fine-tuning technique often confused with quantization. LoRA adapts a foundation model to your task by adding small trainable matrices on top. Quantization compresses the model. The two can be combined.

Inference optimisation is the umbrella term that covers quantization alongside batching, caching and hardware-specific compilation. When a vendor says “optimised for inference” they may be doing one or several of these. Worth asking which.

Edge AI is the deployment pattern quantization enables. AI running on a local server, a laptop or an in-store device rather than the cloud. The economics, the latency and the residency story all depend on the model being small enough to fit.

The honest test of any on-premise AI proposal is the per-query maths. The vendor who has done it will show you Q-level, hardware spec and cost-per-call without prompting. The vendor who has not is selling you headline hardware and asking you to trust the rest.

What is model quantization? Why it matters for your business

Key takeaways

What is quantization?

Why it matters for your business

Where you will meet it

When to ask about it, when to ignore it

Sources

Frequently asked questions

Will a quantized model give worse answers than the full-precision version?

Should I ask my SaaS vendor about quantization?

Does quantized AI help with UK data residency requirements?

Ready to talk it through?

If any of this sounds familiar, let's talk.

What is model quantization? Why it matters for your business

Key takeaways

What is quantization?

Why it matters for your business

Where you will meet it

When to ask about it, when to ignore it

Related concepts

Sources

Frequently asked questions

Will a quantized model give worse answers than the full-precision version?

Should I ask my SaaS vendor about quantization?

Does quantized AI help with UK data residency requirements?

Ready to talk it through?

Related reading

Zero-shot vs few-shot learning: when AI works on tiny data

What is AutoML? Why it matters for your business

What is edge AI? Why running AI locally matters for your business

If any of this sounds familiar, let's talk.