Fine-tuning a GPT — LoRA
This post explains the proven fine-tuning method LoRA, the abbreviation for “Low-Rank Adaptation of Large Language Models”. In this post, I will walk you through the LoRA technique, its architecture, and its advantages. I will present related background knowledge, such as the concepts of “low-rank” and “adaptation” to help your understanding. Similar to “Fining-tune a GPT — Prefix-tuning”, I cover a code example and will walk you through the code line by line. I will especially cover the GPU-consuming nature of fine-tuning a Large Language Model (LLM). Then I talk more about practical treatments which have been packaged as Python libraries “bitsandbytes” and “accelerate”. After completing this article, you will be able to explain:
- When do we still need fine-tuning?
- The challenges in adding more layers — Inference Latency
- What is the “rank” in the “low-rank” of LoRA?
- The architecture of LoRA
- The advantages of LoRA
- Fine-tuning is still GPU-intensive
- The techniques to reduce the use of GPUs
- The code example
Why do we still need fine-tuning?
Pretrained Large Language Models (LLMs) are already trained with different types of data for various tasks such as text summarization…