Quantization

Quantization is a model-compression technique that represents weights, activations, or computation with lower-precision numeric formats. For example, model weights stored as 16-bit floating-point values may be converted to 8-bit or 4-bit integers.

Lower precision reduces memory requirements and memory bandwidth, which can make inference faster and allow larger models to run on less expensive hardware. The actual speed improvement depends on whether the deployment hardware and runtime have optimized kernels for the chosen format.

Two broad approaches are:

Post-Training Quantization (PTQ): quantize an already trained model, often using a calibration dataset.
Quantization-Aware Training (QAT): simulate lower precision during training or fine-tuning so the model adapts to quantization error.

Methods also differ in whether they quantize weights only or both weights and activations, and whether scaling parameters are applied per tensor, channel, or smaller group of values. More granular scaling usually preserves accuracy better but adds metadata and implementation complexity.

Quantization is not the same as model distillation. Quantization changes numeric precision, while distillation trains a student model to imitate a teacher. The methods can be combined.

Aggressive quantization may degrade perplexity, factual accuracy, reasoning, tool use, or performance on rare inputs. Aggregate benchmarks can hide regressions in specific languages or domains, so the quantized artifact should be evaluated directly on its intended workload.

Quantization is particularly important for open-weights models and local Small Language Models (SLMs).

The GPTQ paper introduced an influential post-training method for low-bit transformer weights.

The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.

It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.

Proximal Policy Optimization (PPO)

Proprietary Model

Prompt Optimization

Prompt Injection Attack

Prompt IDE

Reasoning Model

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning with Verifiable Rewards (RLVR)

Reranking

Retrieval-Augmented Generation (RAG)