Skip to content

Quantization

Technical Infrastructure

Reducing the precision of AI model weights (from 32-bit to 8-bit or 4-bit) to make models smaller and faster while sacrificing some quality.

Quantization compresses AI models by using less precise numbers. A model stored in full precision (FP32) uses 32 bits per parameter. Quantizing to INT8 uses 8 bits (4x smaller), and INT4 uses 4 bits (8x smaller). The model becomes smaller, loads faster, and runs with less VRAM.

This is crucial for running large models locally. Llama 70B in full precision needs ~140GB VRAM — impossible on consumer hardware. Quantized to 4-bit, it needs ~35GB — runnable on an M2 Ultra MacBook or dual RTX 4090 setup.

The quality tradeoff depends on the quantization method and level. Modern quantization (GPTQ, GGUF, AWQ) is remarkably good — 4-bit models retain 95-98% of full precision quality for most tasks. Very low precision (2-3 bit) shows noticeable degradation.

Real-World Example

Running a 4-bit quantized version of Llama 70B on your MacBook gives you 95%+ of the full model's quality at a fraction of the memory requirement.

Related Terms

More in Technical Infrastructure

FAQ

What is Quantization?

Reducing the precision of AI model weights (from 32-bit to 8-bit or 4-bit) to make models smaller and faster while sacrificing some quality.

How is Quantization used in practice?

Running a 4-bit quantized version of Llama 70B on your MacBook gives you 95%+ of the full model's quality at a fraction of the memory requirement.

What concepts are related to Quantization?

Key related concepts include VRAM (Video RAM), GPU (Graphics Processing Unit), Parameters, Self-hosting, Inference. Understanding these together gives a more complete picture of how Quantization fits into the AI landscape.