Version: 0.11.x

Launching Takeoff

Getting started with Docker

On first run

If this is your first time running Takeoff, there's two simple extra steps you'll need to follow:

Installing Docker

Takeoff runs using Docker, which you'll need to install first. To run models on GPU, you'll also need to have installed the NVIDIA Container Toolkit and have it configured to work with docker. A guide to do this can be found here.

Validating Takeoff

On first run, you'll need to ensure you have access to our docker registry (run docker login -u takeoffusers and enter the docker authentication token you were provided). You'll then need to provide a license key the first time you launch the server (use docker run with -e LICENSE_KEY=[your_license_key]). See accessing takeoff for more info.

One takeoff container allows inference to be run on multiple models, with each model being optimised and deployed as a reader To launch a single reader/model using docker run, supply the required information as environment variables and docker options:

Example launch
docker run --gpus all \
    -e TAKEOFF_MODEL_NAME=TheBloke/Llama-2-7B-Chat-AWQ \
    -e TAKEOFF_DEVICE=cuda \
    -e LICENSE_KEY=[your_license_key] \
    -e TAKEOFF_MAX_SEQUENCE_LENGTH=1024 \ 
    -p 3000:3000 \ #Port to forward from container
    -v ~/.takeoff_cache:/code/models \ #Volume mount
    tytn/takeoff-pro:0.11.0-gpu  #Specify gpu or cpu image

This example runs the gpu version of takeoff (tytn/takeoff-pro:0.11.0-gpu) and mounts .takeoff_cache into the container so that the local filesystem can cache models for use between takeoff instances. It launches a single reader which orchestrates the model specified by TAKEOFF_MODEL_NAME.

CPU and GPU images

Version to Download	Image
CPU	`tytn/takeoff-pro:0.11.0-cpu`
GPU	`tytn/takeoff-pro:0.11.0-gpu`

Takeoff comes as one of two images. The CPU image is much smaller, but will only allow models to be run on the CPU. The GPU image allows models to be run on either CPU or GPU, but is much larger.

Docker `run` variables

Environment variables

The name, type and behaviour of the launched model can be specified by Takeoff-specific docker environment variables (-e). All mandatory variables, as well as some key optional ones, are listed below.

Environment Variable Name	Default Value	Explanation
TAKEOFF_MODEL_NAME	None *(required)*	The name of the model to initially use - either a huggingface model or the name of a folder mounted to `/code/models`
TAKEOFF_DEVICE	None *(required)*	The device that the server should use - either `"cuda"` or `"cpu"`.
TAKEOFF_CONSUMER_GROUP	`primary`	Used to set the name of a consumer group that the initial model should belong to.
TAKEOFF_MAX_BATCH_SIZE	8	Sets the batch size the model can use.
TAKEOFF_BATCH_DURATION_MILLIS	100	The timeout interval (in ms) for dynamic batching.
TAKEOFF_ACCESS_TOKEN	None	Access token for private Huggingface repositories.
TAKEOFF_CUDA_VISIBLE_DEVICES	None	(GPU only) Which GPUs are visible to the reader. If unspecified then uses all available GPUs. List must be of a length which is a power of 2, e.g. `"0,1"`.
TAKEOFF_TENSOR_PARALLEL	1	(GPU only) How many GPUs to split the model across. Will greedily select the lowest n gpus.
TAKEOFF_QUANT_TYPE	None	(GPU only) Type of quantization used with model. If no value provided, AWQ will be used if in the model name. If `"awq"`, then will use AWQ irrespective of model name.
TAKEOFF_NVLINK_UNAVAILABLE	0	(GPU only) Should be set to 1 if you are on a system without NVLINK (e.g. L4s, 4090s) to allow use of multi-gpu.
TAKEOFF_MAX_SEQUENCE_LENGTH	None (Strongly recommended)	(GPU only) The maximum forseen length of prompt + generated tokens. If not set, will use model's maximum sequence length from its config file. See more below.
LICENSE_KEY	None *(required on first run)*	Takeoff license key for key validation
OFFLINE_MODE	false	Run takeoff in offline mode

Note that only LICENSE_KEY, OFFLINE_MODE, TAKEOFF_MAX_BATCH_SIZE and TAKEOFF_BATCH_DURATION_MILLIS (and thus none of the variables marked as required) are supported when using a manifest file to configure Takeoff.

Picking (or omitting) a MAX_SEQUENCE_LENGTH

We strongly recommend you set a value for TAKEOFF_MAX_SEQUENCE_LENGTH, as the Takeoff Inference Engine pre-allocates a block of memory based on this value, and using the default value (i.e. the maximum sequence length of the selected model) will usually cause an Out of Memory error.

When running on CPU, this behaviour isn't present and so this variable is not used. To control generation length, you should instead use the prompt_max_length and max_new_tokens parameters at inference time.

Docker options

Takeoff requires the use of some standard Docker options, which are detailed in this section.

Standard Docker Options

Key Docker options are listed below. These should be provided as flags, as shown in the examples.

Option	Purpose	Example	Use in Takeoff
-v	Volume mounts a directory, making a local filesystem folder available to the container, with syntax host_directory:container_directory	`-v ~/.takeoff_cache:/code/models` Attaches the local takeoff_cache folder to /code/models inside the container	Allows model files hosted on the local machine be available within the container. Model files can then be shared between instances, rather than each instance having to download a new copy.
-p	Forwards a container port to a host's port, with syntax host_port:container_port	`-p 3005:3000` Forwards the internal port `3000` (Takeoff's inference endpoint) to `3005` on the host system.	Takeoff's ports must be forwarded to make its endpoints accessible outside of the container. The RHS port should be one of `3000` (Inference endpoint & playground), `3001` (management API) or `9090` (metrics endpoint). Multiple -p options should be used to allow access to each endpoint locally.
-it	Starts the container in interactive mode	-it	Allows server logs to be monitored and interacted with, e.g. allowing Takeoff to be terminated by CTRL+C
--gpus	Specifies which gpus are available to the container	--gpus all	Allows Takeoff to access GPUs
--shm-size	Set the amount of memory available for IPC within the container.	--shm-size 2gb	Allows the various processes in Takeoff to communicate. Strongly reccomended to be set at 2gb.

See a full reference here.

Configuring Takeoff using manifest files

Takeoff's parameters can also be specified via a config.yaml manifest file, mounted to /code/config.yaml in the container.

config.yaml

The config manifest consists of two sections:

server_config: Parameters which control the server as a whole. If not specified, will use the defaults from above.
readers_config: An array of reader configurations, with each specifying the behaviour of a specific reader. You can launch Takeoff without any readers by leaving this array empty.

Keys are specified as in the table but without the TAKEOFF_ prefix. The following should be noted:

LICENSE_KEY and OFFLINE_MODE can only be specified as environment variables using -e to docker run.
Standard Docker options can only be specified as arguments to docker run.
'server_config' variables in config.yaml (i.e. max_batch_size and batch_duration_millis) can be overridden by passing environment variables to docker run.
'reader_config' variables must not be passed in to docker run whilst specifying readers using a manifest file. A reader's configuration is specific to that reader, and so a separate configuration must be given for each via the config file.

config.yaml is of the format:

takeoff:
  server_config:
    <AppConfig>
  readers_config:
    <ReaderName>:
      <ReaderConfig>
    <ReaderName>:
      <ReaderConfig>

Where:

<Appconfig>
batch_duration_millis: int
max_batch_size: int #Applies to all embedding models

<ReaderConfig>
model_name: str *required*
device: str *required*
consumer_group: str *required*
max_batch_size: int #Per-model, generative models only.
cuda_visible_devices: str #e.g. "0,1,2,3" or "0"
tensor_parallel: int
quant_type: str
max_sequence_length: int
nvlink_unavailable: int

Example

This example details serving a single generative model by launching with a single reader. Launching multiple readers at start time is also possible, and is detailed here as part of the model management functionality.

config.yaml
takeoff:
  server_config:
    batch_duration_millis: 200
  readers_config:
    reader:
      model_name: "meta-llama/Llama-2-7b-chat-hf"
      device: "cuda"
      quant_type: "awq"
      consumer_group: "primary"
      max_batch_size: 32
      max_sequence_length: 1024
      cuda_visible_devices: "0"

This file can then be mounted into the container, and Takeoff launched with docker run: /code/config.yaml

Example launch
docker run --gpus all \
    -p 3000:3000 \ #Port to forward from container
    -v ~/.takeoff_cache:/code/models \ #Volume mount for models folder
    -v ./config.yaml:/code/config.yaml \ #Volume mount for config file
    tytn/takeoff-pro:0.11.0-gpu  #Specify gpu or cpu image

warning

Note that TAKEOFF_MODEL_NAME and TAKEOFF_DEVICE were mandatory variables when previously using docker run. As we're using a config manifest, these variables must not be passed.

Launching Takeoff

Getting started with Docker​

CPU and GPU images​

Docker run variables​

Environment variables​

Docker options​

Configuring Takeoff using manifest files​

config.yaml​

Example​