Model management via config manifest
Takeoff can be configured to run multiple models on a single machine by specifying each reader individually. These are specified as an array (readers_config
) of ReaderConfig
s within a config.yaml
file, which is then mounted to /code/config.yaml
inside of the Takeoff container. More details on using config manifest files can be found here.
Example​
This example launches two consumer groups, one for embedding and one for generation, and puts a single embedding model in the embedding consumer group and two copies of one LLaMA model in the generation group. A total of three models are concurrently hosted on a single machine. A copy of llama-2-7b is placed on each of the available gpus, and a smaller embedding model is hosted on the cpu, all administered from a single Takeoff container.
takeoff:
server_config: #Shared across readers
readers_config:
reader1:
model_name: "intfloat/e5-small-v2"
device: "cpu"
consumer_group: "embed"
max_sequence_length: 1024
batch_duration_millis: 200
max_batch_size: 64
reader2:
model_name: "meta-llama/Llama-2-7b-chat-hf"
device: "cuda"
quant_type: "awq"
consumer_group: "generate"
max_batch_size: 32
max_sequence_length: 1024
cuda_visible_devices: "0" #Put on first gpu i.e. with device_id 0
reader3:
model_name: "meta-llama/Llama-2-7b-chat-hf"
device: "cuda"
quant_type: "awq"
consumer_group: "generate"
max_batch_size: 32
max_sequence_length: 1024
cuda_visible_devices: "1"
This file can then be mounted into the container, and Takeoff launched with docker run
. Note that in this example we also forward port 3001
, allowing us to manage the launched readers via the Management API.
docker run --gpus all \
-p 3000:3000 \ #Port to forward from container, for inference
-p 3001:3001 \ #Port to forward from container, for model management
-v ~/.takeoff_cache:/code/models \ #Volume mount for models folder
-v ./config.yaml:/code/config.yaml \ #Volume mount for config file
tytn/takeoff-pro:0.14.3-gpu #Specify gpu or cpu image