Member-only story
Four Quantization Techniques for Large Language Models
Large language models (LLMs) are indeed large. Training and using them is still a barrier to individuals and startup companies. Reducing the memory and computation requirements is an important goal. How do we do that? Today, I will walk you through several kinds of quantization techniques. Most of all, I sketched the above triangles for you to take away:
- For fast deployment, use Post-Training Quantization.
- For high accuracy, use Quantization-aware training.
- For limited memory, use 4-bit quantization fine-tuning.
- For a mix of these goals, use mixed precision.
The essence of all quantization techniques is to convert high-precision numerical values into lower-precision representations.
Just in case you want to understand the foundation of LLMs better, you may feel interested in this article titled “Explain the Transformer to a Smart Freshman”.
Let’s talk about saving first.
Money Talk First
The ongoing usage charge (inferencing) for an LLM is an important factor for a user. For example, for the 13 billion parameters LLaMA 2, the size is:
- Full-size (FP-16): ~ 26 GB (Gigabytes)