Glossary

What is LLM Quantization?

Reducing model precision to improve efficiency while maintaining quality.

What is quantization?

Quantization reduces the precision of model weights from 32-bit or 16-bit floating point to lower precision formats like INT8 or INT4. This shrinks model size and speeds up inference while maintaining most of the original performance.

Common Quantization Methods

  • GPTQ: Post-training quantization for GPT-style models
  • AWQ: Activation-aware weight quantization
  • GGUF: Format for CPU inference (llama.cpp)
  • bitsandbytes: Dynamic quantization for training

Why Monitor Quantized Models

Quantization can introduce subtle quality degradation:

  • Hallucination rates may increase on edge cases
  • Reasoning quality can degrade on complex tasks
  • Compare quantized vs full-precision outputs
  • Track metrics to ensure acceptable quality

Does quantization affect quality?

Yes, but often minimally. Well-quantized models retain 95-99% of original quality. However, edge cases may degrade more. Monitor quantized models in production to catch quality regressions.

Monitor quantized model quality

Start Free