Quazi Computer Vision | AI | Machine Learning | LLM | AWS

LoRA: Fine-tuning Large Matrices with Low Rank Adaptation

Introduction

Fine-tuning large models, especially in deep learning, can be computationally expensive and slow. A common challenge when working with large language models (LLMs) or other neural networks is how to adapt or fine-tune these models without updating all the parameters. Fine-tuning large matrices in these models can be resource-intensive due to the sheer size of the weight matrices.

One effective solution to this problem is the LoRA (Low-Rank Adaptation) technique. LoRA allows for a more efficient fine-tuning process by decomposing large weight matrices into smaller, low-rank matrices. This not only reduces the number of trainable parameters but also preserves the integrity of the pretrained model.

In this blog post, we’ll break down the LoRA technique step by step, using a simple Python code example to show how it works. By the end of the article, you’ll have a clear understanding of how LoRA can be used to fine-tune large matrices efficiently.

The Problem with Fine-Tuning Large Models

When you fine-tune a large model, you typically adjust its weight matrices based on new training data. In modern neural networks, these weight matrices can be massive. For example, a simple layer in a model may have a weight matrix of size \(W\), where \(W\) can be hundreds or even thousands of dimensions large.

Fine-tuning this large matrix directly has two major challenges:

  1. Memory Constraints: The size of these matrices means that storing and updating them can be expensive in terms of memory.

  2. Overfitting Risk: If you update all parameters, especially when your new training dataset is small, you risk overfitting to the new data and losing the general knowledge from the pretrained model.

Enter LoRA: Fine-Tuning with Low-Rank Matrices

LoRA provides an elegant solution to these challenges by freezing the original weight matrix \(W\) and introducing two smaller matrices, \(A\) and \(B\), which have a much lower rank. The idea is that instead of updating \(W\) directly, you approximate its updates using the product of two smaller matrices.

Here’s how LoRA works:

  1. Keep the original weight matrix \(W\) fixed. This ensures that the model retains the general knowledge from pretraining.

  2. Introduce two low-rank matrices:
    • \(A\) with dimensions \((d_{\text{out}}, r)\)
    • \(B\) with dimensions \((r, d_{\text{in}})\)

    Where \(r\) is a small integer that represents the rank. \(r\) is much smaller than either \(d_{\text{out}}\) or \(d_{\text{in}}\), the dimensions of the original weight matrix \(W\).

  3. Compute the low-rank update: The update to \(W\) is given by the product of these two matrices: \(\Delta W = A \times B\)

  4. Add the update to the original matrix: The final weight matrix used during training becomes: \(W_{\text{new}} = W + A \times B\)

    This allows the model to adjust its weights for the new task without needing to update all parameters in the original weight matrix.

By learning these two smaller matrices, LoRA drastically reduces the number of trainable parameters, making the fine-tuning process much more efficient.

Step-by-Step LoRA Algorithm

Let’s break down the LoRA algorithm step by step:

Step 1: Initialization

  • Start with a pretrained model: We begin with a model that has already been trained on a large dataset. The weight matrices in this model, like \(W\), are already optimized for general tasks.

  • Choose a low-rank \(r\): The rank \(r\) is a small number that dictates the size of the low-rank matrices \(A\) and \(B\). For example, if \(W\) is a \(6 \times 6\) matrix, we might choose \(r = 2\).

Step 2: LoRA Decomposition

  • Decompose the update: Instead of updating \(W\) directly, we introduce two smaller matrices, \(A\) and \(B\). These matrices are of much lower rank than \(W\), making them much smaller and easier to train.

  • Freeze the original weights: The original matrix \(W\) is frozen during fine-tuning. This means its values are not changed, ensuring that the general knowledge captured during pretraining is preserved.

Step 3: Training

  • Train \(A\) and \(B\): During the fine-tuning process, only the two small matrices, \(A\) and \(B\), are trained. This allows the model to adapt to new data without requiring updates to the entire weight matrix.

  • Apply the low-rank update: The update matrix \(\Delta W = A \times B\) is added to the original matrix \(W\). This small update allows the model to fine-tune itself for the new task.

Step 4: Inference

  • Inference with LoRA: At inference time, you can either compute the update \(A \times B\) on the fly or precompute the new matrix \(W_{\text{new}} = W + A \times B\). The latter option is often more efficient, as it allows you to use the updated weight matrix just like a normal layer in the model.

Do We Need \(A\) and \(B\) During Inference?

This brings up an important question: Do we need \(A\) and \(B\) during inference?

The answer depends on whether you want to compute the update matrix \(A \times B\) during inference or precompute it after training.

  1. Compute on the fly: If you compute \(A \times B\) during inference, you will need both \(A\) and \(B\) at inference time. This allows you to dynamically adjust the weights as needed.

  2. Precompute \(W_{\text{new}}\): Alternatively, you can precompute \(W_{\text{new}} = W + A \times B\) after training is complete. In this case, you no longer need \(A\) and \(B\) during inference, as you’re using the updated weight matrix directly. This method is computationally more efficient during inference.

Python Example: Applying LoRA to a Simple Matrix

Now that we’ve explained how LoRA works, let’s look at a simple Python example that demonstrates the LoRA technique using random matrices. This example shows how to update a large matrix using the product of two smaller, low-rank matrices.

import numpy as np

# Step 1: Initialize the large matrix W (e.g., a 6x6 matrix)
d_out = 6  # Output dimension
d_in = 6   # Input dimension

# Simulate the large matrix W (initialized randomly)
W = np.random.randn(d_out, d_in)
print("Original W matrix:")
print(W)

# Step 2: Initialize low-rank matrices A and B
r = 2  # Rank of the low-rank approximation

# A has dimensions (d_out, r)
A = np.random.randn(d_out, r)

# B has dimensions (r, d_in)
B = np.random.randn(r, d_in)

# Step 3: Compute the update matrix (A * B)
delta_W = np.dot(A, B)  # This gives us the low-rank update

# Step 4: Update the original matrix W
W_new = W + delta_W  # New matrix after LoRA update

print("\nUpdate matrix (A * B):")
print(delta_W)

print("\nUpdated W matrix:")
print(W_new)

Output:

Original W matrix:
[[ 0.57982682  0.56981892 -0.31648517  0.74758125 -0.31424568 -1.25113497]
 [-0.45936964 -1.21938562  0.34957394  0.38277855  0.66722387  0.71315171]
 ...
 
Update matrix (A * B):
[[ 0.25153752 -0.31847642  0.27297871  0.16230158 -0.20924836  0.45321087]
 [-0.47939022  0.38251002 -0.21547047 -0.27092739  0.32487688 -0.51983357]
 ...
 
Updated W matrix:
[[ 0.83136434  0.2513425  -0.04350646  0.90988283 -0.52349405 -0.7979241 ]
 [-0.93875986 -0.8368756   0.13410347  0.11185116  0.99210075  0.19331814]

Conclusion

The LoRA technique is a powerful and efficient way to fine-tune large models without needing to update all their parameters. By learning low-rank matrices \(A\) and \(B\), we can significantly reduce the computational and memory costs of fine-tuning while still allowing the model to adapt to new tasks.