The Challenges of Training LLMs: Lots of Time and Resources
Suppose you want to train a Large Language Model(LLM), which can understand and produce human-like text. You want to input questions related to your organization and get answers from it.
The problem is that the LLM doesn't know your organization, It only knows general things. That is where applying techniques to the models like Finetuning, RAG and many others comes up.
If we want to train Big LLMs, It requires a lot of resources and time. So it's a hefty task unless you have the proper machine to do the job.
Story of How We Solved The Problem of Time and Resources
Suppose we want to train the Llama 2 LLM based on the information of our organization, and we are using Google Colab to train it. The free version of Colab provides a single Nvidia T4 GPU, which provides 16GB of memory.
But for training the Llama 2 - 7 Billion Parameter model we require 28GB of memory.
This is a problem, We can't train the model with only 16GB of memory.
So to solve this, we tried researching into some optimization techniques and we found LoRA, Which stands for Low-Rank Adaptation of Large Language Models.
LoRA adds a layer of finetuning to the model, without modifying the existing model. This consumes less time and memory.
By using LoRA, I was able to finetune the Llama-2 Model and get the outputs from it from a single T4 GPU.
Refer to the above image. I asked the Llama2 model without finetuning a question, How many servers does Hexmos Have? It gave the reply that it is unable to provide the information.
After finetuning I asked the same question, and it gave me this reply
Hexmos has 2 servers in Azure and 4 servers in AWS
Let's see how LoRA helped me achieve this.
How LoRA Helps with Finetuning More Efficiently
Let's have a deeper dive into how LoRA works.
When training large models like GPT-3, it has 175 Billion Parameters. Parameters are like numbers that are stored in Matrices, it is like the knobs and dials that the model tweaks to get better at its task. Fully Finetuning them to our needs is a daunting task and requires a lot of computational resources.
LoRA, takes a different approach to this problem, Instead of fine-tuning the entire model, it focuses on modifying a smaller set of parameters.
Consider the above 2 boxes. One represents the weights for the existing model, the second one represents our fine-tuned weights(Based on our custom dataset). These are added together to form our fine-tuned model.
So by this method, We don't need to change the existing weights in the model. Instead, we add our fine-tuned weights on top of the original weights, this makes it less computationally expensive.
So another question may arise, how are these finetuned weights calculated?
In Matrices, we have a concept called Rank.
Rank, in simple words, determines the precision of the model after finetuning, If the Rank is low, There will be more optimization. But at the same time, you will be sacrificing the accuracy of the model.
If the Rank is high, the precision will be higher but there will be lesser optimization.
The LoRA weight matrix is calculated by multiplying 2 smaller matrices.
For example, we have to multiply 1x5 and 5x1 together to form a 5x5 LoRA weight matrix.
We can set the rank of the smaller matrix to determine the balance between precision and optimization.
Real Life Example: Training a Llama2 Model with Custom Dataset
Continue reading the article