Version: Next

Running Takeoff with LoRAs

This guide will cover how to start Takeoff with Low Rank Adapters (LoRA) modules attached to the models. LoRAs are a popular way to fine-tune LLMs in a way that is designed to be low cost while performing comparably to the much more resource intensive full-fine tuning.

LoRA training works by freezing the original weights of a model, and instead only training small Adapters that are added to the model. These adapters are much smaller than the original model, commonly between 1-5% of the size of the original model.

You can load many LoRAs simultaneously with a single model, and include in your request which LoRA model you want to interact with.

Launching Takeoff with LoRAs.

To launch a container that attaches LoRAs to a generative model you must add an environment parameter called TAKEOFF_LORAS. This will be a comma-separated list of huggingface LoRA names, passed in as a single string.

docker run \
    -e TAKEOFF_MODEL_NAME=meta-llama/Llama-3.2-3B \
    -e TAKEOFF_DEVICE=cuda \
    -e TAKEOFF_LORAS=hf-repo/your-lora-1,hf-repo/your-lora-2,hf-repo/your-lora-3
    -p 3000:3000 \
    -v ~/.takeoff_cache:/code/models \
    -it \
    --gpus all \
    tytn/takeoff-pro:0.21.2-gpu

The above command will attach the 3 specified LoRAs to the original.

If you are using a manifest file then you can pass in the loras in the following way:

takeoff:
  server_config:
  readers_config:
    reader1:
      model_name: "meta-llama/Llama-3.2-3B"
      loras: "hf-repo/your-lora-1,hf-repo/your-lora-2,hf-repo/your-lora-3"
      device: "cuda"
      consumer_group: "generate"

Restrictions on LoRAs

To use a LoRA with a model the base model of the LoRA must match the model specified in the TAKEOFF_MODEL_NAME parameter. You can check that the base models match by looking in the adapter_config.json file of the LoRA. The base_model_name_or_path field must match the model name. Other than that the LoRAs may apply to different weights within the model, and can have different values of r and alpha and they can still be inferenced in parallel.

Interacting with LoRAs

When you make a generation request, you can specify that you want to interact with a particular LoRA attached to that model. You do that by specifying the lora_id parameter in the generation request.

Here is an example request that requests a specific LoRA:

curl -X POST \
    "http://localhost:3000/generate_stream" \ 
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{
        \"text\":\"List 3 things to do in London.\",
        \"sampling_temperature\":0.1,
        \"lora_id\":\"hf-repo/your-lora-1\"
        }"

This will use the LoRA with the name hf-repo/your-lora-1. Specifying no LoRA will send the request to the base model, with no LoRAs attached.

Running Takeoff with LoRAs

Launching Takeoff with LoRAs.​

Restrictions on LoRAs​

Interacting with LoRAs​

Launching Takeoff with LoRAs.

Restrictions on LoRAs

Interacting with LoRAs