Quickstart
Welcome to Titan Takeoff! Getting your first model optimised and ready for inference should only take a few minutes, following these steps.
If this is your first time running Takeoff, there are extra steps you'll need to follow:
Installing Docker
Takeoff runs using Docker, which you'll need to install first. To run models on GPU, you'll also need to have installed the NVIDIA Container Toolkit and have it configured to work with Docker. A guide on how to do this can be found here.
Validating Takeoff
On first run, you'll need to ensure you have access to our Docker registry (run docker login -u takeoffusers
and enter the Docker authentication token you were provided).
You'll then need to provide a license key the first time you launch the server (use docker run
with -e LICENSE_KEY=[your_license_key]
).
See Validating Takeoff for more info.
Starting Takeoff​
To get up and running with Takeoff, we'll use an example Mistral generative model mistralai/Mistral-7B-Instruct-v0.1
.
Help! Mistral-7b is too big...
According to the model memory calculator, Mistral-7B requires at least 14GB of memory to run in full precision (e.g. 14GB of VRAM).
You can use any supported model for this tutorial, such as TheBloke/Mistral-7B-Instruct-v0.1-AWQ
,
which needs only 3.5GB of memory thanks to its use of int4 quantization. See here for more on determining which models you can launch with your available hardware.
We can then Takeoff by using the command:
docker run --gpus all \ #--gpus tells docker to use gpus
-e TAKEOFF_MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.1 \
-e TAKEOFF_DEVICE=cuda \
-e LICENSE_KEY=<your_license_key> \
-e TAKEOFF_MAX_SEQUENCE_LENGTH=1024 \
-p 3000:3000 \
tytn/takeoff-pro:<TAKEOFF_VERSION>-gpu
The variables used are:
TAKEOFF_MODEL_NAME
: Model to useTAKEOFF_DEVICE
: What device to run it on (cuda for GPU, cpu for CPU)LICENSE_KEY
: License key to authenticate your copy of TakeoffTAKEOFF_MAX_SEQUENCE_LENGTH
: The maximum length of an input and generation in tokens.
Additionally, this requires port 3000
to be forwarded to interact with the container.
Querying your model​
You can check your model is running by navigating to the frontend. This will be hosted at localhost:3000 if you didn't specify a port in the docker run
command.
Enter a prompt and press send, then watch as the response is streamed back. As the model is running with random token sampling, the output you see will likely be different (but still reasonable).
Querying via cURL​
You can also try querying the model via cURL. To try a non-streaming response, we use the generate
endpoint.
curl http://localhost:3000/generate \
-X POST \
-N \
-H "Content-Type: application/json" \
-d '{"text": "What are the main ingredients in a cake?"}'
{"text":"The main ingredients in a cake are flour, sugar, eggs, butter or oil, baking powder, and vanilla extract. These ingredients form the basic structure of a cake, with variations depending on the specific type of cake being made. Other common ingredients include cocoa powder for chocolate cakes, milk for moist cakes, and different flavorings such as lemon zest or almond extract."}