Skip to main content
Version: Next

Running multiple models at once


This guide will cover how to run more than one model at once using Takeoff. We'll cover:


A quick introduction to Consumer Groups​

Deploying multiple models with Takeoff requires the use of consumer groups, which are essentially groups of readers. Every reader is part of a consumer group, and there can be multiple readers per consumer group.

Consumer groups distinguish model functionality. For example, you could have two consumer groups each with one reader in them, with one consumer group having a very large language model and the other having a smaller language model in it. You could then have an application route easier tasks to the smaller model, and harder ones to the larger model. You would use the consumer group to specify which readers should receive the request. Similarly, you could have one consumer group have an embedding model and another a generative model, with a RAG application sending requests to each of them at the relevant points.

There are also cases where you'd want multiple, homogeneous readers in a single consumer group, but these are only useful in distributed contexts which we won't cover here.

If you want to configure multiple readers at runtime, you need to use a manifest file.

Deploying one model with a manifest file​

To help illustrate how manifest files work we'll revisit our example from the guide on launching a single model in a docker container, but instead launch it using a manifest file. The first step is the same - you should ensure you have access to the Takeoff Docker repository and have Docker and the NVIDIA Runtime Toolkit installed.

Using docker run and environment variables, we had:

Example launch without using a manifest
docker run \
-e TAKEOFF_MODEL_NAME=meta-llama/Llama-3.2-1B \
-e TAKEOFF_DEVICE=cuda \
-e LICENSE_KEY=[your_license_key] \
-e TAKEOFF_ACCESS_TOKEN=123
-p 3000:3000 \
-v ~/.takeoff_cache:/code/models \
-it \
--gpus all \
tytn/takeoff-pro:0.21.2-gpu

We can extract our reader configuration from here and put it into a manifest file:

my_config.yaml
takeoff:
server_config:
access_token
readers_config:
reader1:
model_name: "meta-llama/Llama-3.2-1B"
device: "cuda"
consumer_group: "generate"

Notice how the reader we implicitly setup using environment variables now has to be explicitly defined and given a reader ID - I chose "reader1". We also now need to assign the reader a consumer_group - I went for "generate" as its a generative model. When we add an embedding model later we'll put it in an "embed" consumer group (these names are arbitrary though.) reader is configured to use the same model we previously specified by -e argument to docker run.

We can now launch Takeoff using this config file by mounting it to the relevant location with -v ./my_config.yaml:/code/config.yaml:

Example launch using a manifest
docker run --gpus all \
-p 3000:3000 \
-v ~/.takeoff_cache:/code/models \
-v ./my_config.yaml:/code/config.yaml \
-e LICENSE_KEY=[your_license_key] \
tytn/takeoff-pro:0.21.2-gpu

Note how we still have to pass in LICENSE_KEY via -e argument to docker run. Variables which do not start with TAKEOFF_ are intended to be kept out of manifest files, and so can only be specified with -e to docker run. Also note that the location of the config file inside the container is fixed - if the manifest is not mounted at this location it will be ignored.

This should spawn a reader in the same way as not using a manifest, with only two minor differences from the result obtained from the manifest-less example here. Here, the readerID and consumer group are explicitly specified, whereas in the manifest-less case they are autogenerated and defaulted to "primary" respectively.

ASIDE: Modifying the server config.

In this guide we're focusing on configuring Takeoff's readers, but Takeoff also has a central server to communicate with them. The server has its own parameters that you can see here. These are mostly used for advanced cases so we won't go into more detail here, but they can be specified through a mix of the manifest and environment variables, with anything specified via -e in the docker run command overriding that specified in the manifest file. LICENSE_KEY, OFFLINE_MODE and NO_COLOR can only be specified as environment variables using -e to docker run (as can the Standard Docker options like --gpus and --shm-size.

Deploying multiple models with a manifest file​

Now that we've covered how to use manifests, adding more models is straightforward as illustrated by this example which launches two consumer groups, one for embedding and one for generation, with a single reader in each.

my_config.yaml
takeoff:
server_config:

readers_config:
reader1:
model_name: "intfloat/e5-small-v2"
device: "cpu"
consumer_group: "embed"
reader2:
model_name: "meta-llama/Llama-2-7b-chat-hf"
device: "cuda"
consumer_group: "generate"

As we did with a single model, we can launch takeoff as follows:

Example launch with multiple models
docker run --gpus all \
-p 3000:3000 \
-p 3001:3001 \
-v ~/.takeoff_cache:/code/models \
-v ./my_config.yaml:/code/config.yaml \
tytn/takeoff-pro:0.21.2-gpu

You'll notice we've also forwarded port 3001 here. This is the Takeoff Management API which is not required for using multiple models, but is useful for us to check that they've been launched properly. With 3001 forwarded we can now visit the management GUI at http://localhost:3001. Model Management Home

This lets us see the status of our multiple readers: Readers

You should be able to see your two consumer groups, and the model in each of them. The management GUI also lets you change which models are running after you've launched with the initial manifest.

Changing which models are running after launch time.​

You can change which models are running either by using the GUI from our now exposed Management frontend or using the Management API itself.

For example, using the GUI you can add a model to an existing consumer group or create a new consumer group entirely:

Reader Creation Page

It's worth noting, however, that the management API is required to access the full range of configuration options for the Reader. We'll now add a reader using the API, but first lets check which readers are currently running by sending a GET request to localhost:3001/reader_groups.

You can generate the requests we're going to use here and see the entire API specification here. You can send a request from the API docs themselves or generate a curl command such as this:

curl -L 'http://localhost:3001/reader_groups' \
-H 'Accept: application/json'

We'll then be expecting a response like this:

GET http://localhost:3001/reader_groups
{
"embed": [
{
"reader_id": "reader1",
"device": "cpu",
"model_name": "intfloat/e5-small-v2",
"model_type": "EMBEDDING",
"pids": [
139
],
"ready": true
}
],
"reader1": [
{
"reader_id": "reader1",
"device": "cpu",
"model_name": "intfloat/e5-small-v2",
"model_type": "EMBEDDING",
"pids": [
139
],
"ready": true
}
],
"reader2": [
{
"reader_id": "reader2",
"device": "cuda",
"model_name": "meta-llama/Llama-2-7b-chat-hf",
"model_type": "CAUSAL",
"pids": [
141
],
"ready": true
}
],
"generate": [
{
"reader_id": "reader2",
"device": "cuda",
"model_name": "meta-llama/Llama-2-7b-chat-hf",
"model_type": "CAUSAL",
"pids": [
141
],
"ready": true
}
]
}
note

Note that this endpoint returns consumer groups, with the readers in each one listed. Each reader actually spawns an additional consumer group with the same name as its id, which is why you'll see each reader appearing twice here.

We can then add a reader with a POST request to localhost:3001/reader with a ReaderConfig as the body. We'll launch a reranker model to go with our embedding and generative models, creating a new consumer group classify for it using this body:

Body
{
"consumer_group": "classify",
"device": "cpu",
"model_name": "BAAI/bge-reranker-base"
}

We can then issue the request to get this response. Note the reader ID that's returned in the response, we'll need it to delete the reader in a second! Ours is 476f28b2-069f-4b5b-9471-b29ed089b31b but they're randomly generated so yours will differ.

POST localhost:3001/reader
[
"476f28b2-069f-4b5b-9471-b29ed089b31b",
{
"model_name": "BAAI/bge-reranker-base",
"device": "cpu",
"consumer_group": "classify"
}
]

We can run a GET request to check for the specific reader ID we assigned:

GET localhost:3001/reader/476f28b2-069f-4b5b-9471-b29ed089b31b

{
"model_name": "BAAI/bge-reranker-base",
"model_type": "SEQUENCE_CLASSIFICATION",
"pids": [
448
],
"consumer_groups": [
"generclassifyate",
"476f28b2-069f-4b5b-9471-b29ed089b31b"
],
"consumer_name": "476f28b2-069f-4b5b-9471-b29ed089b31b",
"device": "cpu"
}

You can see that the consumer group classify was created when we added a new reader with the classify consumer group. Let's now try to delete the reader using:

DELETE localhost:3001/reader/476f28b2-069f-4b5b-9471-b29ed089b31b
// This one only returns a 204 status code if it succeeds..

This will remove the reader from the consumer group and kill it. We can then check it's been deleted:

GET http://localhost:3001/reader_groups

{
"reader1": [
{
"reader_id": "reader1",
"device": "cpu",
"model_name": "intfloat/e5-small-v2",
"model_type": "EMBEDDING",
"pids": [
139
],
"ready": true
}
],
"embed": [
{
"reader_id": "reader1",
"device": "cpu",
"model_name": "intfloat/e5-small-v2",
"model_type": "EMBEDDING",
"pids": [
139
],
"ready": true
}
],
"generate": [
{
"reader_id": "reader2",
"device": "cuda",
"model_name": "meta-llama/Llama-2-7b-chat-hf",
"model_type": "CAUSAL",
"pids": [
141
],
"ready": true
}
],
"reader2": [
{
"reader_id": "reader2",
"device": "cuda",
"model_name": "meta-llama/Llama-2-7b-chat-hf",
"model_type": "CAUSAL",
"pids": [
141
],
"ready": true
}
]
}


As the reader was the last reader in the consumer group, the consumer group has also been deleted.

info

As the management API can be used to remove or manipulate readers, you should consider restricting access to its ports to only the administrator.