Member-only story

Distillation — From Teacher to Student

Chris Kuo/Dr. Dataman
12 min readMar 13, 2025

--

Figure (1): The Master Teacher and the Student

When you try to use a large language model (LLM) and see a lot of its variations, are you confused about which one to use? Let’s take DeepSeek-R1 as an example. Its full-size base model has 671B parameters. There is a long list of its “distill” versions from 1.5B to 70B parameters as shown below:

  • DeepSeek-R1-Distill-Qwen-1.5B
  • DeepSeek-R1-Distill-Qwen-7B
  • DeepSeek-R1-Distill-Llama-8B
  • DeepSeek-R1-Distill-Qwen-14B
  • DeepSeek-R1-Distill-Qwen-32B
  • DeepSeek-R1-Distill-Llama-70B

Similarly, almost all LLMs have their distilled versions (e.g., BERT vs. DistilBERT, GPT2 vs. DistilGPT2, BERTa vs. DistilRoBERTa, Whisper vs. Distil-Whisper, see Hugging Face model repository).

Why are there distilled versions for the full versions? But most of all, what is distillation, and how does it work? This post answers those questions. After reading this post, you will realize:

  1. It is for fast inferencing (to use the model).
  2. Distillation is from a complex model to a smaller model.
  3. Distillation may not be something absolutely new to you. You may have done distillation before on non-LLM projects.
  4. You will build a complex GBM and…

--

--

Chris Kuo/Dr. Dataman
Chris Kuo/Dr. Dataman

No responses yet