Supported models
Huggingface models
Takeoff supports most generation & embedding models natively supported by HuggingFace Transformers, which includes most models available on the HuggingFace Hub.
Models from the Llama-2
, Mistral
or Mixtral
families benefit from further optimisations to the base Takeoff optimisations.
Multi-gpu support is also available for models from these families, enabled by specifying the number of devices to use with the tensor_parallel
variable.
Models which are quantized using AWQ are supported, with AWQ being the recommended method with which to run large models on smaller hardware. Read more about AWQ and Quantization here. Suitable AWQ models are available here.
Using your own models
How can I use a model I have saved locally?​
If you have fine-tuned a model already, you might want to run that in the Takeoff server instead of a huggingface model. There are two ways to do this.
- Volume mounting
- Upload to Huggingface
Save the model locally and volume mount it.
Example:
Lets say we have trained a model and saved it locally. For example, using this python code:
model = AutoModelForCausalLM.from_pretrained('...')
# Your training code ...
tokenizer.save_pretrained('my_model')
model.save_pretrained('my_model')
Then on the command line when running the takeoff server you can mount the model directory onto Takeoff's internal /code/models/jf folder.
docker run --gpus all \
-v /path/to/\<my_model>:/code/models/jf/<my_model> \
-e TAKEOFF_MODEL_NAME=my_model \
-e TAKEOFF_DEVICE=cuda \
tytn/takeoff-pro:0.11.0-gpu
Upload the model to a private Huggingface Hub, and pass in your token to allow Takeoff to download the model.
Example:
docker run --gpus all \
-e TAKEOFF_MODEL_NAME=<My-HF-Account/My-Model> \
-e TAKEOFF_ACCESS_TOKEN=<My-HF-Token> \
-e TAKEOFF_DEVICE=cuda \
tytn/takeoff-pro:0.11.0-gpu
Choosing the right model
Selecting the right model requires optimising performance under your hardware constraints. Models are often issued in different sizes, and can be quantized to different levels, each affecting the performance and memory usage. We discuss balancing these factors in more details here.
To help you avoid Out of Memory errors, we have also created a memory calculator that will estimate the amount of memory a model will use. This can be accessed from the Takeoff inference GUI. You can also specify your hardware's specifications to determine if a specific model can be run on your configuration. See more about using the calculator here.