What is Model Distillation?

TL;DR

Transferring knowledge from a large 'teacher' model into a smaller 'student' to slash inference cost and latency while preserving quality.

Model Distillation: Definition & Explanation

Model Distillation, introduced by Hinton et al. (2015), trains a small 'student' model to imitate the output distribution of a large 'teacher' model — preserving most of the performance while dramatically cutting inference cost, latency, and GPU memory. By 2026 it is a core production strategy: Claude Haiku, GPT-5 mini/nano, Gemini 3 Flash, Llama 3.3/4, and Mistral Small/Tiny all leverage distillation. Three approach families: (1) Response-based (student matches teacher logits), (2) Feature-based (student mirrors intermediate representations), (3) Relation-based (student preserves inter-sample relationships). Classics include DistilBERT (~40% fewer params, 97% performance), TinyBERT, and MiniLM. In the LLM era, 'synthetic data distillation' — generating massive teacher-labeled datasets to train students — is dominant. OpenAI's distillation API, Anthropic's Claude Haiku 4.5, and Together AI's distillation services routinely deliver 5-10x cost reduction in production. Essential for edge/mobile deployment, real-time inference, and cost-optimized workloads.

Related AI Tools

Related Terms

AI Marketing Tools by Our Team