Skip to main content
Version: 0.12.x

Model Management via API

Takeoff comes with a Management API for dynamically managing readers and consumer groups. This is hosted at the port specified by TAKEOFF_MANAGEMENT_PORT, which defaults to 3001, with docs at e.g. localhost:3001/docs.

info

As the management API can be used to remove or manipulate readers, you should consider restricting access to its ports to only the administrator.

A full API spec is available here.

Viewing consumer groups

You can view the current configuration of Takeoff's consumer groups by sending a GET request to localhost:<port>/reader_groups. This will return a JSON object of the following form:

GET localhost:<port>/reader_groups
{
// This model would have been loaded when you first spun up Takeoff
"primary": [
{
"reader_id": "fast-t5",
"model_name": "google/flan-t5-base",
"backend": "hf",
"model_type": "CAUSAL",
"consumer_group": "primary",
"pids": [12],
"ready": true
}
],
// This is a consumer group you would have created by using the management API without having to reload the server!
"llamas": [
{
"reader_id": "llama_1",
"model_name": "TitanML/llama2-7b-base-4bit-AWQ",
"backend": "awq",
"model_type": "CAUSAL",
"consumer_group": "llamas",
"pids": [13]
},
{
"reader_id": "llama_2",
"model_name": "TitanML/llama2-7b-base-4bit-AWQ",
"backend": "awq",
"model_type": "CAUSAL",
"consumer_group": "llamas",
"pids": [14]
}
]
}

You can view a specific reader configuration by hitting GET localhost:<port>/reader/<reader_id>. If reader_id=llama_2 this will return a JSON object of the following form:

GET localhost:<port>/reader/:<reader_id>
{
"model_name": "TitanML/llama2-7b-base-4bit-AWQ",
"backend": "awq",
"model_type": "CAUSAL",
"consumer_group": "llamas",
"pids": [14]
}

Adding a New Reader

To add a new reader, POST to the localhost:<port>/reader with the reader configuration as the HTTPS body. The reader configuration is a JSON object containing ReaderConfig fields.

Removing a Reader

To remove a reader, send a DELETE request to localhost:<port>/reader/<reader_id>. This will remove the reader from the consumer group and kill it. If the reader is the last reader in the consumer group the consumer group will be deleted.

Processing Parallelisation

Spinning up multiple readers has benefits over launching multiple Takeoff servers using an orchestrator like Kubernetes.

For example, spinning up multiple readers with the same configuration and adding them to the same consumer group allows you to process more requests in parallel. So for instance, if you had a machine with 4xA10 GPUs, you could spin up 4 readers each with a Llama model:

#! /bin/bash

# Spin up Takeoff initially with server config and the first reader
docker run --gpus all --shm-size 2G -e TAKEOFF_MODEL_NAME="TitanML/llama2-7b-base-4bit-AWQ" -e TAKEOFF_ACCESS_TOKEN=$TAKEOFF_ACCESS_TOKEN -e TAKEOFF_DEVICE="cuda" -e TAKEOFF_CUDA_VISIBLE_DEVICES=0 -v $HOME/.takeoff_cache:/code/models -p 3000:3000 -p 3001:3001 -it tytn/takeoff-pro:0.12.0-gpu

# Add 3 more readers to the primary consumer group to join the first reader
for n in "1" "2" "3"
do
curl -X POST "http://localhost:3001/reader" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"model_name\":\"TitanML/llama2-7b-base-4bit-AWQ\",\"device\":\"cuda\",\"consumer_group\":\"primary\",\"cuda_visible_devices\":\"$n\"}"
done

# Check the consumer group config
curl -X GET "http://localhost:3001/reader_groups" -H "accept: application/json"
note

You can also instantiate the multiple readers at launch time through use of manifest file, see more here.

The response of the final curl command would then be:

{
"primary": [
{
"reader_id": "llama_1",
"model_name": "TitanML/llama2-7b-base-4bit-AWQ",
"backend": "compress-fast",
"model_type": "CAUSAL",
"consumer_group": "primary",
"pids": [12]
},
{
"reader_id": "llama_2",
"model_name": "TitanML/llama2-7b-base-4bit-AWQ",
"backend": "compress-fast",
"model_type": "CAUSAL",
"consumer_group": "primary",
"pids": [13]
},
{
"reader_id": "llama_3",
"model_name": "TitanML/llama2-7b-base-4bit-AWQ",
"backend": "compress-fast",
"model_type": "CAUSAL",
"consumer_group": "primary",
"pids": [14]
},
{
"reader_id": "llama_4",
"model_name": "TitanML/llama2-7b-base-4bit-AWQ",
"backend": "compress-fast",
"model_type": "CAUSAL",
"consumer_group": "primary",
"pids": [15]
}
]
}

Now we can send queries to the inference endpoint at either localhost:3000 + /generate or /generate_stream. Requests will be sent to the primary consumer group (as it is the default) and will be processed in parallel by the 4 readers, improving the server's throughput. Note, that each GPU will have its own copy of the model, and so you will need to ensure each has sufficient memory. If you are struggling to fit your model onto your GPUs then consider using multi-gpu inference, which splits the model across multiple GPUs.