Member-only story

Four Quantization Techniques for Large Language Models

Chris Kuo/Dr. Dataman
12 min readMar 6, 2025

--

Large language models (LLMs) are indeed large. Training and using them is still a barrier to individuals and startup companies. Reducing the memory and computation requirements is an important goal. How do we do that? Today, I will walk you through several kinds of quantization techniques. Most of all, I sketched the above triangles for you to take away:

  • For fast deployment, use Post-Training Quantization.
  • For high accuracy, use Quantization-aware training.
  • For limited memory, use 4-bit quantization fine-tuning.
  • For a mix of these goals, use mixed precision.

The essence of all quantization techniques is to convert high-precision numerical values into lower-precision representations.

Just in case you want to understand the foundation of LLMs better, you may feel interested in this article titled “Explain the Transformer to a Smart Freshman”.

Let’s talk about saving first.

Money Talk First

The ongoing usage charge (inferencing) for an LLM is an important factor for a user. For example, for the 13 billion parameters LLaMA 2, the size is:

  • Full-size (FP-16): ~ 26 GB (Gigabytes)

--

--

Chris Kuo/Dr. Dataman
Chris Kuo/Dr. Dataman

Responses (1)