Fine-tuning large language models (LLMs) like LLaMA 3 can unlock their full potential, tailoring their capabilities to specific tasks and applications. Google Colab provides an ideal platform for this process, offering powerful computing resources in a convenient, cloud-based environment.
In this article, we’ll guide you through the essential steps of fine-tuning LLaMA 3—or any other LLM—in Colab. Whether you're looking to enhance model performance for specific domains, improve accuracy on custom datasets, or develop unique AI applications, this tutorial will provide you with the practical knowledge and tools you need. Join us as we explore the techniques, best practices, and nuances of fine-tuning LLMs, empowering you to create highly customized models that excel in your particular area of interest.
Please refer the article How to run Llama 3 (or) any LLM in Colab for setting up LLM in colab.
You can watch the video-based tutorial with step by step explanation down below.
PEFT Model Setup
The PEFT (Performance, Experience, Fit, and Trust) model is a framework used primarily in Human Resources and organizational psychology to assess employee performance and satisfaction.
Next let us see how to set up a PEFT (Parameter-Efficient Fine-Tuning) model using the FastLanguageModel.get_peft_model() function.
model = FastLanguageModel.get_peft_model(
model,
r = 16,
lora_alpha = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = True,
random_state = 42,
max_seq_length = max_seq_length
)
model: This is the pre-trained language model that will be fine-tuned. It is typically a transformer-based model such as GPT, BERT, or similar.
r = 16: This is a rank parameter that defines the rank of the low-rank adaptation matrices. In LoRA, instead of updating the full weight matrices in the model, low-rank matrices are introduced. A higher value of r increases the capacity of the adapters and allows for more flexible adaptation during fine-tuning.
lora_alpha = 16: This is a scaling factor for the LoRA weights. It adjusts the contribution of the low-rank matrices during the forward pass. A higher lora_alpha increases the impact of these adapted weights.
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]: These are the specific layers within the transformer model that will be fine-tuned using LoRA. In a transformer architecture:
q_proj, k_proj, v_proj refer to the query, key, and value projection matrices in the self-attention mechanism.
o_proj refers to the output projection matrix.
gate_proj, up_proj, down_proj refer to projection layers in the feed-forward networks or gate mechanisms (often used in more advanced transformer architectures like GPT or BERT variants).
lora_dropout = 0: This sets the dropout rate for the LoRA layers to 0, meaning no dropout is applied. Dropout is a regularization technique that randomly deactivates some neurons during training to prevent overfitting. Here, it is disabled.
bias = "none": This indicates how bias terms in the model are treated. "none" means that no bias terms are being fine-tuned. In some cases, fine-tuning biases along with weights can improve model performance, but it is skipped here.
use_gradient_checkpointing = True: Enabling gradient checkpointing reduces memory usage by recomputing some intermediate results during backpropagation instead of storing them. This is particularly useful when fine-tuning large models, as it allows the model to use less GPU memory at the cost of slightly longer training times.
random_state = 42: This sets the random seed to 42, ensuring that any randomness (such as random weight initialization or dropout) is consistent across different runs. It ensures reproducibility of results.
max_seq_length = max_seq_length: This specifies the maximum sequence length that the model can process. It defines the maximum number of tokens the model can handle in a single input instance. This is crucial for determining how much text can be input during training or inference.
Next let us define a prompt template for a language model that follows the structure of the Alpaca format. Alpaca is often used for fine-tuning models, especially for tasks related to instruction-following.
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
alpaca_prompt: This variable holds a string that is used as a template for creating prompt data. It combines an instruction, an input, and a response to fine-tune or evaluate a language model.
Instruction: The section marked ### Instruction: is where you input the task or directive for the model. It describes what the model should do.
Input: The ### Input: section provides additional context or information to help the model complete the task.
Response: The ### Response: section is where the model's output would go. It represents the completion of the task based on the instruction and input.
Data Preprocessing
Let us write a python function, to format a list of examples into a structure compatible with the Alpaca-style prompt template defined earlier. It takes an input examples, which is expected to be a dictionary containing instructions, inputs, and outputs, and returns a formatted text using the Alpaca prompt.
def format_input_prompt(examples):
# get the list with keys
instructions = examples['instruction']
inputs = examples['input']
outputs = examples['output']
texts = []
for instruction, input, output in zip(instructions, inputs, outputs):
# format the input prompt
text = alpaca_prompt.format(instruction, input, output)
texts.append(text)
return {"text": texts}
Access the Keys in the examples Dictionary
'instruction': A list of tasks or commands.
'input': A list of inputs providing further context to the instruction.
'output': A list of expected responses that the model should generate
These lists are extracted into the variables instructions, inputs, and outputs.
Initialize an Empty List to Hold the Formatted Prompts.
The zip() function is used to loop over the corresponding values in the instructions, inputs, and outputs lists simultaneously.
For each combination of instruction, input, and output, the function applies the alpaca_prompt template using format().
This generates a single string formatted like the Alpaca-style prompt and appends it to the texts list.
The function returns a dictionary where the key is "text", and the value is the list of formatted Alpaca prompts. This structure can be useful for further processing, like preparing data for model training.
Next we will import the dataset.
# import the dataset
from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split='train')
dataset = dataset.map(format_input_prompt, batched=True)
The datasets library from Hugging Face is used to load datasets. This library makes it easy to access and manipulate various datasets for machine learning tasks.
The dataset "yahma/alpaca-cleaned" is being loaded from Hugging Face’s dataset repository.
The split='train' argument specifies that the training split of the dataset is being loaded.
This dataset likely contains instructions, inputs, and outputs, which are commonly used to fine-tune language models to follow instructions (based on Alpaca's fine-tuning format).
The map() function is used to apply format_input_prompt to the entire dataset. It transforms each example by formatting the instruction, input, and output using the Alpaca prompt template.
The batched=True argument indicates that the function will be applied in batches, which means that the format_input_prompt function will process multiple examples at once (instead of one by one). This can improve efficiency, especially when working with large datasets.
Next Let us check what we have in the dataset.
dataset
dataset[0]
This will display the first example from the transformed dataset. After applying the format_input_prompt function, the output will contain the Alpaca-style formatted text in the "text" field.
Trainer Setup
Next we will train the model.
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model = model, # peft model
train_dataset = dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
args = TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=30,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=1234,
output_dir="outputs"
)
)
SFTTrainer: This is the main trainer for supervised fine-tuning a language model. It allows you to fine-tune the model with minimal boilerplate, while handling various training dynamics like gradient accumulation, mixed precision, and more.
model = model: The model being trained is the PEFT model (Parameter-Efficient Fine-Tuning model) that has already been set up using LoRA in the previous steps.
train_dataset = dataset: The training dataset is the one that was loaded and formatted using the alpaca_prompt. It contains Alpaca-style prompts in the "text" field.
dataset_text_field = "text": This specifies the field in the dataset where the model will retrieve the text for training. In this case, it is the "text" field, which contains the formatted Alpaca prompts.
max_seq_length = max_seq_length: This defines the maximum sequence length the model will handle during training. It ensures that the input text does not exceed this length, which can be critical for memory management and training efficiency.
The TrainingArguments are crucial for defining the training configuration. Here's a breakdown of each argument:
per_device_train_batch_size=2: This sets the batch size per device (e.g., GPU). Here, it’s set to 2, meaning each GPU will handle two samples at a time.
gradient_accumulation_steps=4: Since the batch size is small, this accumulates gradients over multiple steps (4 steps in this case) before performing a backward pass. It allows you to effectively simulate a larger batch size without running out of memory.
warmup_steps=10: The number of warmup steps for the learning rate scheduler. During these steps, the learning rate will gradually increase from 0 to the set value (2e-4 in this case).
max_steps=30: The total number of training steps. The model will train for 30 steps.
learning_rate=2e-4: The learning rate for the optimizer. A rate of 2e-4 (0.0002) is set for fine-tuning, which is typical for transformer models.
fp16=not torch.cuda.is_bf16_supported() and bf16=torch.cuda.is_bf16_supported(): This ensures that the training uses mixed precision. If BF16 (bfloat16) is supported by the GPU, it will use that format, otherwise it will fall back to FP16 (float16). These settings help speed up training while saving memory on supported hardware.
logging_steps=1: The model will log metrics like loss after every training step. This can be useful for tracking progress, especially for short training runs.
optim="adamw_8bit": The optimizer is set to AdamW (8-bit), which helps to reduce memory usage by using 8-bit quantization. It’s efficient for large models and commonly used in fine-tuning.
weight_decay=0.01: Regularization is applied via weight decay to prevent overfitting. A value of 0.01 is typical for fine-tuning.
lr_scheduler_type="linear": A linear learning rate scheduler is used, which decreases the learning rate linearly after the warmup period.
seed=1234: This ensures reproducibility by setting a seed for random number generation. All aspects of training will be consistent across different runs if the same seed is used.
output_dir="outputs": The directory where all training artifacts (like model checkpoints and logs) will be saved.
This setup fine-tunes a PEFT-based language model using a highly efficient approach with mixed precision (FP16 or BF16), gradient accumulation, and 8-bit optimizers. The trainer will log progress at each step and save the model to the "outputs" directory after 30 steps.
Next we will initiate the training process using the configured SFTTrainer and the provided dataset, logging training progress and collecting statistics such as the loss and other metrics.
trainer_stats = trainer.train()
Once the training is complete, the statistics collected during training are stored in the trainer_stats variable. This will typically include information such as:
Loss: The loss at various training steps.
Steps: The number of steps the model has been trained for.
Learning Rate: The current learning rate at each step.
Time: How long the training took.
Next let us display trainer statistics.
trainer_stats
TrainOutput(global_step=30, training_loss=2.679690368970235, metrics={'train_runtime': 226.5245, 'train_samples_per_second': 1.059, 'train_steps_per_second': 0.132, 'total_flos': 2750561593786368.0, 'train_loss': 2.679690368970235, 'epoch': 0.00463678516228748})
When you run trainer_stats after calling trainer.train(), it will output a dictionary or an object containing various statistics and logs related to the training process. The exact contents of trainer_stats may vary depending on the configuration and implementation of the SFTTrainer you're using.
Save the Model
Next we will save the trained model.
## save the model
model.save_pretrained("./best_model")
tokenizer.save_pretrained('./best_model')
This saves the model's architecture, configuration, and weights to the "./best_model" directory. The model can be later loaded from this directory using from_pretrained().
The tokenizer is responsible for converting raw text into tokens and must be saved along with the model. This ensures consistency between the input tokenization during training and inference.
The tokenizer’s configuration and vocab files will be saved in the same directory.
Next we will save the model using Unsloth.
## unsloth save model
from unsloth import unsloth_save_model
unsloth_save_model(model, tokenizer, "unsloth_model", )
model:
The fine-tuned model you want to save.
tokenizer:
The tokenizer that should be used with the model.
"unsloth_model":
The directory where the model and tokenizer will be saved.
unsloth_save_model:
This function is likely designed to handle saving models and tokenizers in a format compatible with the Unsloth library or for specific deployment scenarios. It might include additional features such as versioning, compression, or compatibility checks.
The function call unsloth_save_model(model, tokenizer, "unsloth_model") will save both the model and tokenizer to the specified directory, "unsloth_model". This directory will contain all necessary files to reload the model and tokenizer later.
Trained Model Inference
Next we will demonstrate how to use a fine-tuned model for inference with a custom prompt.
FastLanguageModel.for_inference(model)
instruction = "You are a helpful assistant who can answer questions"
input = "Who developed GPT models"
# process the input
inputs = tokenizer([alpaca_prompt.format(instruction, input, "")], return_tensors='pt').to('cuda')
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.batch_decode(outputs)[0]
print(response)
FastLanguageModel.for_inference(model): This method configures the model for inference. It might include adjustments or optimizations specific to inference scenarios, like enabling efficient sampling or configuring decoding strategies.
Define the Instruction and Input: instruction and input variables define the context and query that will be used to generate a response. The instruction sets the context for the model, and the input is the question or prompt for which you want a response.
Format the Input Prompt:
The alpaca_prompt.format(instruction, input, "") formats the instruction and input into the Alpaca-style prompt.
tokenizer([formatted_prompt], return_tensors='pt') tokenizes the formatted prompt and prepares it as PyTorch tensors.
.to('cuda') moves the tensors to the GPU for faster processing.
Generate the Response: model.generate(**inputs, max_new_tokens=100) generates a response based on the formatted input. The max_new_tokens parameter limits the length of the generated response to 100 tokens.
Decode and Print the Response:
tokenizer.batch_decode(outputs) decodes the generated token IDs back into human-readable text.
[0] extracts the first (and only) response from the batch.
print(response) outputs the final generated response.
Final Thoughts
Fine-tuning a large language model (LLM) like LLaMA 3 in Google Colab offers an accessible and powerful way to tailor models to specific tasks or domains. Throughout this guide, we’ve covered the essential steps required to set up, train, and deploy an LLM using Colab.
Fine-tuning LLMs can be resource-intensive. Google Colab provides a convenient environment, but understanding and managing GPU memory and computational limits is essential.
Fine-tuning involves a lot of experimentation with hyperparameters and training configurations. Iterating based on results and metrics will lead to better performance.
Leverage the documentation of the libraries and tools used. Communities and forums can also provide valuable support and insights.
By following these steps, you can effectively fine-tune a large language model like LLaMA 3 to suit your specific needs. The process involves careful preparation, data handling, and experimentation, but the results can significantly enhance the model’s performance on specialized tasks. Happy fine-tuning!
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comments