Quantization
Technical InfrastructureReducing the precision of AI model weights (from 32-bit to 8-bit or 4-bit) to make models smaller and faster while sacrificing some quality.
Quantization compresses AI models by using less precise numbers. A model stored in full precision (FP32) uses 32 bits per parameter. Quantizing to INT8 uses 8 bits (4x smaller), and INT4 uses 4 bits (8x smaller). The model becomes smaller, loads faster, and runs with less VRAM.
This is crucial for running large models locally. Llama 70B in full precision needs ~140GB VRAM — impossible on consumer hardware. Quantized to 4-bit, it needs ~35GB — runnable on an M2 Ultra MacBook or dual RTX 4090 setup.
The quality tradeoff depends on the quantization method and level. Modern quantization (GPTQ, GGUF, AWQ) is remarkably good — 4-bit models retain 95-98% of full precision quality for most tasks. Very low precision (2-3 bit) shows noticeable degradation.
Real-World Example
Running a 4-bit quantized version of Llama 70B on your MacBook gives you 95%+ of the full model's quality at a fraction of the memory requirement.
Related Terms
More in Technical Infrastructure
FAQ
What is Quantization?
Reducing the precision of AI model weights (from 32-bit to 8-bit or 4-bit) to make models smaller and faster while sacrificing some quality.
How is Quantization used in practice?
Running a 4-bit quantized version of Llama 70B on your MacBook gives you 95%+ of the full model's quality at a fraction of the memory requirement.
What concepts are related to Quantization?
Key related concepts include VRAM (Video RAM), GPU (Graphics Processing Unit), Parameters, Self-hosting, Inference. Understanding these together gives a more complete picture of how Quantization fits into the AI landscape.