Supported Models & Hardware

Supported Models

Takeoff supports most generation & embedding models natively supported by HuggingFace Transformers, which includes most models available on the HuggingFace Hub.

Models from the Llama-2, Mistral or Mixtral families benefit from further optimisations to the base Takeoff optimisations.

Multi-GPU support is also available for models from these families, enabled by specifying the devices to use with the TAKEOFF_CUDA_VISIBLE_DEVICES variable.

Models which are quantized using AWQ are supported, with AWQ being the recommended method with which to run large models on smaller hardware. Read more about AWQ and Quantization here.

Selecting the right model requires optimising performance under your hardware constraints. Models are often issued in different sizes, and can be quantized to different levels, each affecting the performance and memory usage. We discuss balancing these factors in more details here.

To help you avoid Out of Memory errors, we have also created a memory calculator that will estimate the amount of memory a model will use. This can be accessed from the Takeoff inference GUI. You can also specify your hardware's specifications to determine if a specific model can be run on your configuration. See more about using the calculator here.

Supported Models Table

Here's a list of all the supported models categorized by size that are readily available on our Hugging Face Hub. These models are divided into three categories based on their parameter size: Small Models (less than 5 billion parameters), Medium Models (between 5 billion and 13 billion parameters), and Large Models (more than 13 billion parameters).

Our Hugging Face Hub hosts various pre-trained models optimized for different use cases, from natural language processing tasks like text generation, question answering, and embeddings, to advanced multi-modal models for specific domains. This allows developers, researchers, and machine learning engineers to deploy these models directly or fine-tune them according to their needs, saving time and computational resources.

We strive to keep these models updated and continuously add new versions with optimized performance and smaller memory footprints (like quantized models). If you're looking for a model that fits your project’s specific requirements, explore the comprehensive list below:

Model Size	Generative	VLM	Model Size	Embedding	Reranker
Small (< 5B)	TitanML/Qwen2-1.5B	TitanML/Qwen2-VL-1.5B	Small (< 100m)	jinaai/jina-embeddings-v2-small-en	jinaai/jina-reranker-v1-tiny-en
	TitanML/Qwen2-Math-1.5B	TitanML/InternVL-1B		sentence-transformers/all-MiniLM-L6-v2	jinaai/jina-reranker-v1-turbo-en
	TitanML/gemma-2-2b	google/paligemma-3b-pt-896		mixedbread-ai/mxbai-embed-large-v1
	Qwen/Qwen2-0.5B	Qwen/Qwen2-VL-2B-Instruct
		OpenGVLab/InternVL2-4B
Medium (5B ~ 13B)	TitanML/Meta-Llama-3.1-8B	llava-hf/llava-v1.6-mistral-7b-hf	Medium(100m ~ 300m)	TitanML/jina-v2-base-en-embed	BAAI/bge-reranker-base
	TitanML/Qwen2-7B	llava-hf/llama3-llava-next-8b-hf		jinaai/jina-embeddings-v2-base-en	ibm/re2g-reranker-nq
	TitanML/Mistral-7B-Instruct-v0.3-AWQ-4bit	llava-hf/llava-1.5-7b-hf		intfloat/multilingual-e5-small
	mistralai/Mistral-7B-v0.3	OpenGVLab/InternVL2-8B		nomic-ai/nomic-embed-text-v1.5
Large (> 13B)	TitanML/Meta-Llama-3.1-70B-Instruct	TitanML/InternVL2-Llama3-76B-AWQ	Large(> 300m)	intfloat/multilingual-e5-large-instruct	BAAI/bge-reranker-v2-m3
	meta-llama/Meta-Llama-3.1-405B-Instruct	Qwen/Qwen2-VL-72B-Instruct		BAAI/bge-large-en-v1.5	BAAI/bge-reranker-v2-gemma
	TitanML/Qwen-72B-Chat	llava-hf/llava-1.5-13b-hf		BAAI/bge-reranker-v2.5-gemma2-lightweight	openbmb/MiniCPM-Reranker
	Qwen/Qwen2-57B-A14B	OpenGVLab/InternVL2-26B
		OpenGVLab/InternVL2-40B

Supported Hardware

Takeoff is designed to work across as wide a range of hardware as possible, to lower the barrier to start working with LLMs. However, to maximize performance on commonly used hardware, hardware-specific optimizations are sometimes used, which means not all hardware types can support all model optimizations. The biggest difference is between Ampere (and later) generation GPUs vs pre-Ampere generation GPUs and CPUs.

Post-Ampere specific optimizations are used for most commonly used model types and for all AWQ models. This means that on CPUs and pre-Ampere GPUs, you cannot use models like Llama, Mistral, or Mixtral out of the box, or any AWQ model.

Hardware Type	Can Use Base Model?	Can Use AWQ Model?
Post-Ampere GPUs	✔️	✔️
Pre-Ampere GPUs	✔️	❌
CPUs	✔️	❌

Pre-Ampere GPUs are those from the Turing and Volta generations or earlier. This includes the T4, V100, and Quadro RTX 8000 GPUs. It also includes any GPU from the 10xx and 20xx series.

Post-Ampere GPUs are those from the Ampere or Hopper generation of GPUs or later. This includes the A10, A6000, A100, H100, L4, and L40S GPUs. It also includes any GPU from the 30xx and 40xx series.