Model Distillation

Model distillation is a training technique that transfers behavior from a capable teacher model to a smaller or more efficient student model. The student learns from teacher-generated probabilities, labels, outputs, reasoning traces, or synthetic data rather than relying only on the original training dataset.

Distillation can target general language capability or a narrower task. A common pipeline uses the teacher to generate answers for a curated set of prompts, filters those answers for quality, and then applies Supervised Fine-Tuning (SFT) to the student.

Several forms are used:

Response distillation: train on final teacher answers.
Logit distillation: match the teacher's output probability distribution.
Feature distillation: align internal representations.
Reasoning distillation: train on intermediate solution traces or explanations.
On-policy distillation: correct or score outputs sampled from the student itself.

The objective is usually to reduce inference cost, latency, memory use, or deployment requirements while retaining important capability. Distillation is often combined with quantization when deploying a Small Language Model (SLM).

A student does not automatically inherit only the teacher's strengths. It can also reproduce biases, factual errors, unsafe behavior, and stylistic artifacts. Teacher-generated datasets require provenance, filtering, contamination checks, and independent evaluation.

Distillation differs from compression methods that only change numerical representation. It changes model parameters through training and can specialize the student to a particular behavior or domain.

The Google Research paper Distilling Step-by-Step demonstrates using teacher-generated rationales as additional supervision for smaller language models.

The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.

It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.

Model Context Protocol (MCP)

Mixture of Models (MoM)

Mixture of Experts (MoE)

Mixture of Agents (MoA)

Machine Learning (ML)

Moderation

Multi-modality

Natural Language Processing (NLP)

Neural Network

One-shot Prompt