Version: 0.13.x

Supported models and hardware

Huggingface models

Takeoff supports most generation & embedding models natively supported by HuggingFace Transformers, which includes most models available on the HuggingFace Hub.

Models from the Llama-2, Mistral or Mixtral families benefit from further optimisations to the base Takeoff optimisations.

Multi-gpu support is also available for models from these families, enabled by specifying the devices to use with the TAKEOFF_CUDA_VISIBLE_DEVICES variable.

Models which are quantized using AWQ are supported, with AWQ being the recommended method with which to run large models on smaller hardware. Read more about AWQ and Quantization here. Suitable AWQ models are available here.

Using your own models

How can I use a model I have saved locally?

If you have fine-tuned a model already, you might want to run that in the Takeoff server instead of a huggingface model. There are two ways to do this.

Volume mounting
Upload to Huggingface

Save the model locally and volume mount it.

Example:

Lets say we have trained a model and saved it locally. For example, using this python code:

model = AutoModelForCausalLM.from_pretrained('...')

# Your training code ...

tokenizer.save_pretrained('my_model')
model.save_pretrained('my_model')

Then on the command line when running the takeoff server you can mount the model directory onto Takeoff's internal /code/models/jf folder.

docker run --gpus all \
    -v /path/to/\<my_model>:/code/models/jf/<my_model> \
    -e TAKEOFF_MODEL_NAME=my_model \
    -e TAKEOFF_DEVICE=cuda \
    tytn/takeoff-pro:0.13.1-gpu

Upload the model to a private Huggingface Hub, and pass in your token to allow Takeoff to download the model.

Example:

docker run --gpus all \
    -e TAKEOFF_MODEL_NAME=<My-HF-Account/My-Model> \
    -e TAKEOFF_ACCESS_TOKEN=<My-HF-Token> \
    -e TAKEOFF_DEVICE=cuda \
    tytn/takeoff-pro:0.13.1-gpu

Choosing the right model

Selecting the right model requires optimising performance under your hardware constraints. Models are often issued in different sizes, and can be quantized to different levels, each affecting the performance and memory usage. We discuss balancing these factors in more details here.

To help you avoid Out of Memory errors, we have also created a memory calculator that will estimate the amount of memory a model will use. This can be accessed from the Takeoff inference GUI. You can also specify your hardware's specifications to determine if a specific model can be run on your configuration. See more about using the calculator here.

Supported Hardware

Takeoff is designed to work across as wide a range of hardware as possible, to lower that barrier to start working with LLMs. However to maximise performance on commonly used hardware, sometimes hardware-specific optimizations are used which means not all hardware types can support all model optimizations. The biggest difference is between Ampere (and later) generation GPUs vs pre-Ampere generation GPUs and CPUs.

Post-Ampere specific optimizations are used for most commonly used model types and for all AWQ models. This means that on CPUs and pre-Ampere GPUs you cannot use models like Llama, Mistral, or Mixtral out of the box, or any AWQ model. Ampere-specific optimizations can be turned off with the flag TAKEOFF_DISABLE_STATIC=1. This makes the base form of Llama, Mistral, Mixtral, etc usable on CPUs and pre-ampere GPUs, although doesn't make AWQ models usable.

Hardware Type	Can Use Base Model?	Can Use AWQ Model?
Post-Ampere GPUs	✔️	✔️
Pre-Ampere GPUs	✔️ if TAKEOFF_DISABLE_STATIC=1	❌
CPUs	✔️ if TAKEOFF_DISABLE_STATIC=1	❌

Pre-Ampere GPUs are those from the Turing and Volta generations or earlier. This includes the T4, V100 and Quadro RTX 8000 GPUs. It also includes any GPU from the 10xx and 20xx series.

Post-Ampere GPUs are those from the Ampere or Hopper generation of GPUs or later. This includes the A10, A6000, A100, H100, L4, L40S GPUs. It also includes any GPU from the 30xx and 40xx series.