Running Takeoff with LoRAs
This guide will cover how to start Takeoff with Low Rank Adapters (LoRA) modules attached to the models. LoRAs are a popular way to fine-tune LLMs in a way that is designed to be low cost while performing comparably to the much more resource intensive full-fine tuning.
LoRA training works by freezing the original weights of a model, and instead only training small Adapters that are added to the model. These adapters are much smaller than the original model, commonly between 1-5% of the size of the original model.
You can load many LoRAs simultaneously with a single model, and include in your request which LoRA model you want to interact with.
Launching Takeoff with LoRAs.​
To launch a container that attaches LoRAs to a generative model you must add an environment
parameter called TAKEOFF_LORAS
. This will be a comma-separated list of huggingface LoRA names,
passed in as a single string.
docker run \
-e TAKEOFF_MODEL_NAME=meta-llama/Llama-3.2-3B \
-e TAKEOFF_DEVICE=cuda \
-e TAKEOFF_LORAS=hf-repo/your-lora-1,hf-repo/your-lora-2,hf-repo/your-lora-3
-p 3000:3000 \
-v ~/.takeoff_cache:/code/models \
-it \
--gpus all \
tytn/takeoff-pro:0.21.2-gpu
The above command will attach the 3 specified LoRAs to the original.
If you are using a manifest file then you can pass in the loras in the following way:
takeoff:
server_config:
readers_config:
reader1:
model_name: "meta-llama/Llama-3.2-3B"
loras: "hf-repo/your-lora-1,hf-repo/your-lora-2,hf-repo/your-lora-3"
device: "cuda"
consumer_group: "generate"
Restrictions on LoRAs​
To use a LoRA with a model the base model of the LoRA must match the model specified in the TAKEOFF_MODEL_NAME
parameter. You can check that the base models match by looking in the adapter_config.json
file of the LoRA. The
base_model_name_or_path
field must match the model name. Other than that the LoRAs may apply to different weights
within the model, and can have different values of r
and alpha
and they can still be inferenced in parallel.
Interacting with LoRAs​
When you make a generation request, you can specify that you want to interact with a particular
LoRA attached to that model. You do that by specifying the lora_id
parameter in the generation
request.
Here is an example request that requests a specific LoRA:
curl -X POST \
"http://localhost:3000/generate_stream" \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d "{
\"text\":\"List 3 things to do in London.\",
\"sampling_temperature\":0.1,
\"lora_id\":\"hf-repo/your-lora-1\"
}"
This will use the LoRA with the name hf-repo/your-lora-1
. Specifying no LoRA will send the request
to the base model, with no LoRAs attached.