Member-only story
Distillation — From Teacher to Student
When you try to use a large language model (LLM) and see a lot of its variations, are you confused about which one to use? Let’s take DeepSeek-R1 as an example. Its full-size base model has 671B parameters. There is a long list of its “distill” versions from 1.5B to 70B parameters as shown below:
- DeepSeek-R1-Distill-Qwen-1.5B
- DeepSeek-R1-Distill-Qwen-7B
- DeepSeek-R1-Distill-Llama-8B
- DeepSeek-R1-Distill-Qwen-14B
- DeepSeek-R1-Distill-Qwen-32B
- DeepSeek-R1-Distill-Llama-70B
Similarly, almost all LLMs have their distilled versions (e.g., BERT vs. DistilBERT, GPT2 vs. DistilGPT2, BERTa vs. DistilRoBERTa, Whisper vs. Distil-Whisper, see Hugging Face model repository).
Why are there distilled versions for the full versions? But most of all, what is distillation, and how does it work? This post answers those questions. After reading this post, you will realize:
- It is for fast inferencing (to use the model).
- Distillation is from a complex model to a smaller model.
- Distillation may not be something absolutely new to you. You may have done distillation before on non-LLM projects.
- You will build a complex GBM and…