Skip to main content
Version: 0.10.x

Quickstart

Welcome to Titan Takeoff! Getting your first model optimised and ready for inference should only take a few minutes, following these steps.

On first run

If this is your first time running Takeoff, there are extra steps you'll need to follow:

Installing Docker

Takeoff runs using Docker, which you'll need to install first. To run models on GPU, you'll also need to have installed the NVIDIA Container Toolkit and have it configured to work with Docker. A guide on how to do this can be found here.

Validating Takeoff

On first run, you'll need to ensure you have access to our Docker registry (run docker login -u takeoffusers and enter the Docker authentication token you were provided). You'll then need to provide a license key the first time you launch the server (use docker run with -e LICENSE_KEY=[your_license_key]). See Validating Takeoff for more info.

Starting Takeoff

To get up and running with Takeoff, we'll use an example Mistral generative model mistralai/Mistral-7B-Instruct-v0.1.

Help! Mistral-7b is too big...

According to the model memory calculator, Mistral-7B requires at least 14GB of memory to run in full precision (e.g. 14GB of VRAM). You can use any supported model for this tutorial, such as TheBloke/Mistral-7B-Instruct-v0.1-AWQ, which needs only 3.5GB of memory thanks to its use of int4 quantization. See here for more on determining which models you can launch with your available hardware.

We can then Takeoff by using the command:

docker run --gpus all \ #--gpus tells docker to use gpus
-e TAKEOFF_MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.1 \
-e TAKEOFF_DEVICE=cuda \
-e LICENSE_KEY=<your_license_key> \
-e TAKEOFF_MAX_SEQUENCE_LENGTH=1024 \
-p 3000:3000 \
tytn/takeoff-pro:<TAKEOFF_VERSION>-gpu

The variables used are:

  • TAKEOFF_MODEL_NAME: Model to use
  • TAKEOFF_DEVICE: What device to run it on (cuda for GPU, cpu for CPU)
  • LICENSE_KEY: License key to authenticate your copy of Takeoff
  • TAKEOFF_MAX_SEQUENCE_LENGTH: The maximum length of an input and generation in tokens.

Additionally, this requires port 3000 to be forwarded to interact with the container.

Querying your model

You can check your model is running by navigating to the frontend. This will be hosted at localhost:3000 if you didn't specify a port in the docker run command.

cake

If you see this (and the refresh icon isn't spinning) then the server is up!

Enter a prompt and press send, then watch as the response is streamed back. As the model is running with random token sampling, the output you see will likely be different (but still reasonable).

Querying via cURL

You can also try querying the model via cURL. To try a non-streaming response, we use the generate endpoint.

cURL query
curl http://localhost:3000/generate \
-X POST \
-N \
-H "Content-Type: application/json" \
-d '{"text": "What are the main ingredients in a cake?"}'
Response
{"text":"The main ingredients in a cake are flour, sugar, eggs, butter or oil, baking powder, and vanilla extract. These ingredients form the basic structure of a cake, with variations depending on the specific type of cake being made. Other common ingredients include cocoa powder for chocolate cakes, milk for moist cakes, and different flavorings such as lemon zest or almond extract."}