Glossary
What is LLM Quantization?
Reducing model precision to improve efficiency while maintaining quality.
What is quantization?
Quantization reduces the precision of model weights from 32-bit or 16-bit floating point to lower precision formats like INT8 or INT4. This shrinks model size and speeds up inference while maintaining most of the original performance.
Common Quantization Methods
- GPTQ: Post-training quantization for GPT-style models
- AWQ: Activation-aware weight quantization
- GGUF: Format for CPU inference (llama.cpp)
- bitsandbytes: Dynamic quantization for training
Why Monitor Quantized Models
Quantization can introduce subtle quality degradation:
- Hallucination rates may increase on edge cases
- Reasoning quality can degrade on complex tasks
- Compare quantized vs full-precision outputs
- Track metrics to ensure acceptable quality
Does quantization affect quality?
Yes, but often minimally. Well-quantized models retain 95-99% of original quality. However, edge cases may degrade more. Monitor quantized models in production to catch quality regressions.
Monitor quantized model quality
Start Free