How to run Llama 3 (or) any LLM in Colab | Unsloth

Hackers Realm

Aug 26, 20246 min read

In recent years, the evolution of language models has revolutionized the landscape of natural language processing (NLP), enabling applications ranging from chatbots to automated content generation. Leveraging these powerful tools like Llama 3 LLM within the accessible environment of Google Colab with unsloth to run opens up a world of possibilities for developers and researchers alike.

This guide aims to demystify the process of using advanced language models in Colab, providing step-by-step instructions and practical examples. Whether you're diving into your first foray of NLP experimentation or seeking to streamline your workflow with the latest model releases, this article will equip you with the foundational knowledge and hands-on skills needed to get started effectively.

You can watch the video-based tutorial with step by step explanation down below.

Install Dependencies

First we will install specific Python packages, particularly tailored for working with transformers, fine-tuning language models, and optimizing performance in a Google Colab environment.

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" -q
!pip install --no-deps "trl<0.9.0" peft accelerate bitsandbytes xformers datasets -q

unsloth: This is a library that likely provides tools for working with LLMs, but it's not a widely known or standard library, so it seems to be a custom or lesser-known tool. The command installs it directly from its GitHub repository.
[colab-new]: This part might specify a particular set of dependencies or configurations optimized for Google Colab.
-q: The -q flag tells pip to run quietly, minimizing the output.
--no-deps: This flag tells pip not to install any dependencies, which is often used when you want to ensure a specific environment setup.
trl<0.9.0: The trl library (short for "transformers library") is used for fine-tuning transformers models, and you're specifying a version less than 0.9.0.
peft: This is a library for Parameter-Efficient Fine-Tuning (PEFT) of large language models.
accelerate: A library by Hugging Face to make training of large models faster and easier, especially in distributed settings.
bitsandbytes: A library that enables 8-bit optimizers, which can drastically reduce memory usage and improve computational efficiency.
xformers: A library to improve the efficiency of transformers, particularly with faster attention mechanisms.

LLM Inference

Next we will demonstrate how to load and prepare a language model using the FastLanguageModel class from the unsloth library.

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)

FastLanguageModel.for_inference(model)

Importing Necessary Modules
- FastLanguageModel: A class from the unsloth library that is likely designed to simplify the loading and use of large language models.
- torch: The PyTorch library, which is a common framework for deep learning, is imported for handling tensor computations and model manipulation.
Setting Configuration Parameters
- max_seq_length = 2048: This sets the maximum sequence length for the model's input. In this case, it is set to 2048 tokens, meaning the model can process input sequences up to this length.
- dtype = None: This variable likely specifies the data type for model weights (e.g., torch.float32, torch.float16), but it's set to None here, meaning the default will be used.
- load_in_4bit = True: This indicates that the model should be loaded in 4-bit precision, which helps in reducing memory usage, making it more feasible to run large models on limited hardware like in Google Colab.
Loading the Model and Tokenizer
- from_pretrained(): This method loads a pre-trained model and its tokenizer.
- model_name = "unsloth/llama-3-8b-bnb-4bit": Specifies the model to be loaded, in this case, a version of LLaMA 3 with 8 billion parameters, likely optimized for 4-bit precision.
- max_seq_length, dtype, and load_in_4bit: These are passed as arguments to configure the model's behavior during loading.
Preparing the Model for Inference
- for_inference(model): This function likely adjusts the model settings to make it ready for inference, optimizing it for generating predictions based on the given input.

Next we will create a template which provides a clear and structured way to interact with a model, ensuring that the responses are aligned with the given tasks and context.

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

The alpaca_prompt string is a template for generating prompts in a structured format, commonly used with models like Alpaca or similar instruction-following language models. The template includes placeholders for the instruction, input, and response, allowing you to dynamically insert these components into the prompt.
Components
1. Instruction:
  - This section describes the task that you want the model to perform. It guides the model on what kind of response is expected.
2. Input:
  - This part provides additional context or specific information that the model needs to complete the task described in the instruction. It can be optional, depending on the task.
3. Response:
  - This is where the model's generated output will be inserted after processing the instruction and input.

Next we will demonstrate how to generate a response from a language model using a specific instruction and input, leveraging the alpaca_prompt template for formatting.

instruction = "You are a helpful assistant who can answer questions"
input = "Who developed GPT models"

# process the input
inputs = tokenizer([alpaca_prompt.format(instruction, input, "")], return_tensors='pt').to('cuda')
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.batch_decode(outputs)[0]
print(response)

instruction: This tells the model what its role is and what it should do—in this case, act as a helpful assistant who answers questions.
input: The specific question or task you want the model to address, here asking about who developed GPT models.
alpaca_prompt.format(instruction, input, ""): The alpaca_prompt template is formatted with the instruction and input. The response part is left empty as it will be generated by the model.
tokenizer(): The tokenizer converts the formatted prompt into input tensors that the model can process. The return_tensors='pt' argument indicates that the output should be in PyTorch tensor format.
.to('cuda'): This moves the tensor to the GPU for faster processing, assuming you are running this in a Colab environment with GPU enabled.
model.generate(**inputs, max_new_tokens=100): The model generates a response based on the inputs. The max_new_tokens=100 argument limits the response to 100 new tokens.
tokenizer.batch_decode(outputs)[0]: The output from the model is decoded back into human-readable text. batch_decode is used because the model's output is typically a batch of tokens. The [0] index is used to get the first (and only) response from the batch.
print(response): This prints the generated response.

Next we will give another input and try to generate response.

instruction = "You are a helpful assistant who can answer questions"
input = "Explain about Transformers in AI?"

# process the input
inputs = tokenize[alpaca_prompt.format(instruction, input, "")], return_tensors='pt').to('cuda')
outputs = model.generate(**inputs, max_new_tokens=100, temperature = 0.1)
response = tokenizer.batch_decode(outputs)[0]
print(response)

Final Thoughts

Google Colab offers a free and convenient environment to experiment with LLMs without the need for expensive hardware or extensive setup. With GPU support, it allows users to run large models efficiently, making advanced NLP research and development more accessible.
Working with LLMs like LLaMA 3 requires careful optimization, especially in resource-constrained environments like Colab. Techniques such as loading models in 4-bit precision, using libraries like bitsandbytes for memory efficiency, and fine-tuning parameters (e.g., max_seq_length, temperature) are essential for maximizing performance and minimizing costs.
Despite its advantages, there are challenges to consider. Colab’s free tier has limitations on session duration, memory, and compute resources, which can be restrictive when working with very large models or datasets. Understanding and working within these constraints is crucial for effective use.
Using LLMs in Google Colab is a blend of power and practicality, offering an accessible way to leverage some of the most advanced tools in AI today. By understanding the platform's strengths and limitations, users can unlock a wide range of applications, from research to real-world deployment, making it an invaluable resource in the modern AI toolkit.

In this tutorial, we have explored the process of using LLaMA in Google Colab, providing a step-by-step guide to effectively harness the power of this advanced language model within a flexible and accessible environment. As we move forward, future tutorials will delve deeper into the intricacies of fine-tuning LLaMA in Colab, enabling you to customize the model to better suit specific tasks and applications. Stay tuned for these upcoming insights, where we will further enhance your understanding and capabilities in working with large language models in Google Colab.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm

How to run Llama 3 (or) any LLM in Colab | Unsloth

Install Dependencies

LLM Inference

Final Thoughts

Related Posts