Version: Next

Running a Takeoff Docker Container

This guide details how to launch the Takeoff Engine as a single docker container on a local machine. We'll cover:

We'll then move onto more bespoke topics:

Getting setup with Docker & Accessing the Takeoff image

Launching Takeoff with a docker container requires both Docker and the NVIDIA Container Toolkit to be installed, with docker then configured to use it.

On first run of the Takeoff container, you'll need to ensure you have access to our Docker registry - run docker login -u takeoffusers and enter the Docker authentication token you were provided. Make sure you've also got your license key to hand - you'll need it to launch the container in a moment.

Deploying a single instance of a model

A quick guide to readers vs containers vs models

Docker deploys an instance of an image as a container. Each instance of a model within a Takeoff container is called a reader, and one Takeoff container can have multiple readers. A model is considered as the description of the model, whereas a reader is a served instance of a model, applying the model to an input by reading an input off of a queue of required inferences. In the local (i.e. all on one machine) case the different readers in a container are usually different models - e.g. one for generation, and one for embedding.

We're going to focus on the local case here, launching a single container and initially just one reader, but there's more information on readers and consumer groups here.

Taking off with `Docker run`

Takeoff has a quick launch option whereby you can configure your inference server entirely by passing arguments to the docker run command. This runs the docker container, and instantiates a single reader with the settings you've specified.

We're going to launch a small generative model, using this command:

Example launch
docker run \
    -e TAKEOFF_MODEL_NAME=meta-llama/Llama-3.2-1B \
    -e TAKEOFF_DEVICE=cuda \
    -e LICENSE_KEY=[your_license_key] \
    -e TAKEOFF_ACCESS_TOKEN=123
    -p 3000:3000 \
    -v ~/.takeoff_cache:/code/models \
    -it \
    --gpus all \
    tytn/takeoff-pro:0.21.2-gpu

Let's run through what each parameter is doing here. We'll start with the Takeoff-specific ones, which are specified as environment variables within the container using docker's -e syntax.

TAKEOFF_MODEL_NAME=meta-llama/Llama-3.2-1B - the model we want to launch an instance of. In most cases this will be a model name from huggingface, but it could also be a filepath.
TAKEOFF_DEVICE=cuda - whether to run the model on gpu (cuda) or cpu. We're using a generative model, so will pick cuda.
LICENCE_KEY - the licence key you were provided with.
TAKEOFF_ACCESS_TOKEN - the model we chose here requires gaining access to the model on huggingface after completing a user agreement. This should be a token generated from your huggingface account, after you've gained access to the model. This access token would also grant you access to any private models that your account has access to. If you don't have access to the model we chose, you can just use a publicly uploaded alternative like unsloth/Llama-3.2-1B-Instruct, allowing you to leave out the access token.

There are also some parameters we need to configure the container itself:

-p 3000:3000 - this forwards the port of the inference endpoint from within the container (left-hand-side) to the host network space (i.e. so you can make inference calls to 'localhost:3000`). You can change the right hand side of this, just bear in mind that the remaining examples here assume Takeoff is accessible on 3000.
-v ~/.takeoff_cache:/code/models - this mounts a folder on the local machine (~/.takeoff_cache, which will be created if it doesn't exist yet) into the container's filesystem, allowing the container to cache models. This parameter is optional, but using it does mean you won't have to re-download the model when you next launch the container.
-it - this runs our container in interactive mode in the terminal. This essentially means you can kill the container with CTRL+C should you wish to.
--gpus all - this specifies what gpus are available to the container, and will be covered in more detail when we get to deploying a multi-gpu model.

There are more details of configuration options for you to dive into here, but the ones above will be enough for our purposes.

GPU vs CPU images

You'll notice we launched an instance of the gpu version of the Takeoff image. There's also a cpu version of it, which is a much smaller image but only lets you run classification or embedding models and only on the cpu.

If you see this in your terminal, then the Takeoff container has launched:

welcome

You'll then want to wait until you see Reader ready to process requests before proceeding.

Running inference with your deployed model

Let's test that our model can run inferences. In a new terminal (or API platform like Postman), run:

curl <http://localhost:3000/generate> \
     -X POST \
     -N \
     -H "Content-Type: application/json" \
     -d '{"text": "What are the main ingredients in a cake?"}'

You should get a response back in the form of JSON:

{"text":" Flour, sugar, eggs, butter, milk, and vanilla extract are the basic ingredients in a cake. However, the type and amount of these ingredients can vary depending on the type of cake.\n\nHere are some common variations:\n\n*   **Flour**: All-purpose flour is a common base for cakes, but you can also use bread flour or cake flour for different textures.\n*   **Sugar**: Granulated sugar is a classic choice, but you can also use brown sugar, honey, or other sweeteners.\n*   **Eggs**: Whole eggs or egg whites are used to add moisture and richness to cakes.\n*   **"}

Using the inference GUI

Takeoff also comes with a GUI to let you quickly experiment with different models and prompts. There's two interfaces available - Chat (similar to that of OpenAI chat (ChatGPT)) and Playground (similar to that of OpenAI completions). Try and access it by navigating to http://localhost:3000 in your browser.

chat interface

When using the chat interface, the model gets a copy of your conversation history, whereas in playground only what's in the input box is sent.

playground interface

Handy playground shortcuts

Shift + Enter: New Line

Enter: Submit / Ask the model for a completion.

You're now ready to use your Takeoff container for inference! Follow the guides here for more information on this, or proceed to the next section for information on how to deploy a model across multiple GPU devices, and how to deploy a model you have stored locally.

Deploying a model across multiple GPUs.

Generative models can be deployed across multi-gpu devices using Takeoff. This, in practice, this allows the use of bigger batch sizes which improves throughput. To configure it, you'll need to tell the reader which GPUs are visible to it by setting TAKEOFF_CUDA_VISIBLE_DEVICES. By default, a reader will only use device 0, but if TAKEOFF_CUDA_VISIBLE_DEVICES is specified it'll use every device visible to it.

For example, if nvidia-smi indicates you have two devices available, you can split your model across those two devices by using TAKEOFF_CUDA_VISIBLE_DEVICES=0,1 Some key things to note here are:

TAKEOFF_CUDA_VISIBLE_DEVICES takes a comma separated list of device integers.
GPU devices are 0-indexed
The number of devices provided must be a power of 2 (1, 2, 4, 8 etc. devices)

Docker restricts communications between parallel processes to a shared memory default with a very small default size. To avoid unexpected issues here, we'll set --shm-size=2gb to increase the size of this buffer.

Whilst we can use multiple GPUs to achieve a higher throughput for our small model, more commonly it's used to allow us to run models which wouldn't otherwise fit on one GPU, as this full example now shows:

Example multi-gpu launch
docker run \
    -e TAKEOFF_MODEL_NAME=meta-llama/Llama-3.2-3B-Instruct \
    -e TAKEOFF_DEVICE=cuda \
    -e LICENSE_KEY=[your_license_key] \
    -e TAKEOFF_CUDA_VISIBLE_DEVICES="0,1" \
    --shm-size=2gb \
    -p 3000:3000 \
    -v ~/.takeoff_cache:/code/models \
    -it \
    --gpus all \
    tytn/takeoff-pro:0.21.2-gpu

Deploying a locally stored (non-huggingface) model

You might want to use a model you've trained locally, instead of one available on huggingface. Whilst one option is to upload the model to huggingface as a private model and then use an access token, Takeoff also lets you mount the model into the container.

You can achieve this using -v and picking a location in the container. Let's say you have your model saved to the ~/my_model folder, you could then use: -v ~/my_model:/local_models/my_model (note that the path in the container has to be absolute for a volume mount).

You can then reference the model inside the container by using the filepath: TAKEOFF_MODEL_NAME=/local_models/my_model. So an entire example here would be:

docker run \
--gpus all \
-p 3000:3000 \
-e TAKEOFF_DEVICE=cuda \
-v ~/my_model:/local_models/my_model  \
-e TAKEOFF_MODEL_NAME=/local_models/my_model  \
tytn/takeoff-pro:0.21.2-gpu

::: Note You'll need to ensure both the model and the tokenizer are saved to the directory you're mounting, even if the tokenizer is one available with another model. :::

What next?

This guide covered how to deploy a single model quickly using a docker container, but you may want to have more than one model deployed at once. The simplest way to do this declaratively in Takeoff is using manifests, using the guide found here. You can also use manifests to declare a configuration for a single model, rather than passing your configuration in as environment variables to docker run, something covered here.

Running a Takeoff Docker Container

Getting setup with Docker & Accessing the Takeoff image​

Deploying a single instance of a model​

A quick guide to readers vs containers vs models​

Taking off with Docker run​

Running inference with your deployed model​

Using the inference GUI​

Deploying a model across multiple GPUs.​

Deploying a locally stored (non-huggingface) model​

What next?​