Frequently Asked Questions

How does Takeoff differ from vLLM?

vLLM is a library for model serving. Doing your own orchestration to serve it and set up monitoring, failure recovery, scaling and load balancing it and keeping it updated among a flurry of dependencies is all extra infrastructure work at a high initial cost and demanding continuous maintenance.
Takeoff is an inference stack, we orchestrate it for you and give you the tools to easily monitor and scale it. We want you to enable you to focus effort on application development, not orchestrating underlying models.
Takeoff typically matches or exceeds the performance achieved by vLLM for many common workloads. Both use continuous batching and paged attention to maximise performance; whilst both optimise for throughput by default,
Future performance gains come free with Takeoff - we handle the full-time job of staying on top of the cutting edge and turning it into performance for you. vLLM requires open-source contributions and then further analysis and configuration with existing installs.
vLLM is built to be as flexible as possible, providing a gulf of (irregularly documented) options with no indication of which might be useful or how to best achieve your intended workload.
vLLM does not support hosting multiple models in a single instance, something typically useful for workloads like RAG.

How does Takeoff differ from Triton inference server?

Triton Inference Server is NVIDIA's attempt to offer some of the features that bridge between a model inference backend (like TensorRT-LLM) and a userland API, providing support for hosting and switching multiple models across your hardware and handling the marshalling of requests to the relevant model.
If vLLM is an alternative to Takeoff's engine, Triton Inference Server can be thought of as an alternative to Takeoff's model management system & batching manager.
Triton Inference Server uses (request-level) dynamic batching to improve throughput originating from multiple requests, whereas Takeoff uses continuous batching. See here for a discussion of the differences.
Triton Inference Server is even more complicated than vLLM to set up, and still doesn't provide the configuration assistance or out-of-the-box scaling support provided by the Takeoff stack.

How can I deploy Takeoff with Cloud Service Providers (AWS/GCP/Azure/Snowflake/etc.)

Takeoff works seamlessly with both your own hardware and cloud software providers. We've provided guides for the above here or would be happy to work with you to integrate Takeoff into your own hardware/CSP setup, even if it isn't with one of the providers listed in our guides.

Will Takeoff work without an internet connection e.g. on airgapped servers?

Yes - Takeoff is designed as a solution for using language models in industries with tight regulatory environments. As part of this, Takeoff may be required to run on an airgapped server or under a very restrictive firewall. Takeoff can work entirely offline/locally, provided the model to be used is provided locally and the licence key is provided as a file (see here for more).

Why is my model running out of memory?

Out Of Memory (OOMs) errors are a hard reality of running Large Language Models on finite GPU resources. Takeoff aims to provide as many tools as possible to get your model running, though, and we'd recommend changing the values of these reader parameters in this order:

Page cache size: The amount of device memory required depends principally on the model weights (which are fixed for a given model/quantization level), and the KV cache size. As we use cached page attention, the size of the KV cache is controlled with the page_cache_size parameter, specified to each reader. By default this is 90% of the device memory remaining after the model weights have been loaded, but can be lowered to a different percentage or changed to an absolute amount (e.g. 2GB) if you find this value isn't working for you. This fixes most out of memory issues that aren't to do with model weights being too large.
Chunked Prefill size: Takeoff breaks down inputs into chunks which are prefilled separately. If you have loaded the model but can't run sequence lengths of 512 without running out of memory, try overriding the value of prefill_chunk_size via the manifest to something less than 512.
Run with multigpu: Takeoff supports tensor parallelism, so if you re-run Takeoff on a machine with multiple GPUs you can split the model across them. To do so you can specify multiple devices with CUDA_VISIBLE_DEVICES.
If more GPU resources aren't available, you could consider using a quantized model. -Quantization is the process of reducing the size of the numbers that represent the weights in your model layers. Trying to deploy the popular llama-2-13b-chat using float32 (more info on data types), precision has a model footprint of ~52GB. This is far too large to fit on most GPUs.
- Using a quantized version: llama-2-13b-chat-awq at int4 precision, our model footprint shrinks to ~6,5GB which would fit on the vast majority of GPUs in the ecosystem! AWQ models are supported in Takeoff by providing the model name as usual.

Is My GPU Supported?

We aim to support the entire ecosystem of GPUs that are out there at the moment. With such a large range available there may be corner cases of tropical hardware we have not encountered. If you are experiencing problems with your hardware's compatibility with Takeoff reach out to your account manager and we will be able to help. See our supported models for more details about how to get your model running on your hardware.

How can I test out Takeoff?

You currently need access to the Takeoff Docker Repository to test out Takeoff. Reach out to us here for a one-month free trial.