Version: 0.16.x

Frequently Asked Questions

Why am I running out of memory?

Out Of Memory (OOMs) errors are a hard reality of running Large Language Models on finite GPU resources. There are potential mitigations we can take, but the appropriate course of action depends on why you're experiencing this error. The first thing you need to establish is whether the model you are trying to load has a larger memory footprint than the GPU memory you have made available to Takeoff.

Your Model Footprint is too big for your provided GPU

Whether you are importing your model from local storage or using an open-source one on HuggingFace, the model weights need to be loaded across from your machine's CPU to GPU memory. This ensures that we have fast access to the model weights at inference time.

These error can seem concrete and often you have to conceded that your model selection and hardware are not compatible. However, we have some workarounds we suggest you can get around this situation.

Provide more resources

Takeoff supports tensor parallelism, so if you re-run Takeoff on a machine with multiple GPUs you can split the model across them. To do so you can specify multiple devices with CUDA_VISIBLE_DEVICES.

Quantization

Quantization is the process of reducing the size of the numbers that represent the weights in your model layers. Trying to deploy the popular llama-2-13b-chat using float32 (more info on data types), precision has a model footprint of ~52GB. This is far too large to fit on most GPUs. Using a quantized version: llama-2-13b-chat-awq at int4 precision, our model footprint shrinks to ~6,5GB which would fit on the vast majority of GPUs in the ecosystem!

To use a quantized model just pass its name as the TAKEOFF_MODEL_NAME environment variable and Takeoff will automatically detect the weight precision and load the model correctly.

There are a range of other AWQ models available here, please get in touch if you want more help with quantization.

Your Model Footprint is less than available GPU memory but still hitting OOMs

If you've checked the size of the model you are trying to run, and you're sure the resources available to Takeoff are adequate, then the OOMs you are hitting are probably a result of the KV Cache growing beyond your memory capacity. Put simply, the KV Cache stores the result of attention matrix calculations so they need not be repeated each time the model generates a new token. It can be statically or dynamically allocated, depending on the model you are trying to run.

Takeoff by default will try and run in Static mode as you get speedups in memory access and protection from OOMs at inference time. If you are running a model that has a popular architecture such as Llama, Takeoff will run in Static mode. In this case, you will only OOM on model loading after the weights have been transferred to your GPU and Takeoff demands a memory allocation that exceeds your hardware's capacity. We have two parameters that we can play with to reduce the size of your KV Cache: TAKEOFF_MAX_SEQUENCE_LENGTH and TAKEOFF_MAX_BATCH_SIZE. The former is the length (in tokens) a single prompt (added to the resulting generated tokens) can grow to, and the latter is the number of sequences the model can process concurrently. The KV Cache footprint is directly proportional to these parameters - reducing them will prevent OOMs. Unfortunately, doing so will also reduce Takeoff's throughput and generation length so if that is a deal breaker you may want to investigate:

If you can use a quantized version of your model.
Providing more resources to your Takeoff instance.
Stop these high workloads from hitting Takeoff and overloading your GPU. Provide a reverse proxy in front of your Takeoff instance with rate-limiting to upper bound the number of requests sent to takeoff within a time window.

If your model is not running with a statically allocated KV Cache you can hit OOMs at runtime. Our suggestions would be:

The steps above use quantization, more resources, or rate-limiting.
Use a Static model to prevent these OOMs under high load:
- Look for an alternative model that has Static support that fulfills your requirements.
- Submit to your account manager (or contact us at hello@titanml.co) a request for us to create static support for your particular model.

Is My GPU Supported?

We aim to support the entire ecosystem of GPUs that are out there at the moment. With such a large range available there may be corner cases of tropical hardware we have not encountered. If you are experiencing problems with your hardware's compatibility with Takeoff reach out to your account manager and we will be able to help. See our supported models for more details about how to get your model running on your hardware.

My Model Is Not As Fast As I Expected

Takeoff Inference Server uses our custom engine to run models, which combines cutting-edge technology such as kernel fusion, quantization support, tensor parallelism, selective CUDA graph usage, continuous batching, and speculative decoding. We have built it to run the most popular models (llama, mixtral, etc...) as fast as possible. One of the great advantages of building our engine is having the ability to make adjustments dynamically or recommendations to get your models running as fast as possible. If you are experiencing performance issues please get in touch with your account manager and we can help you out.

I have a problem deploying Takeoff, what should I do?

We work hard to make sure Takeoff can be deployed as seamlessly as possible but if you do experience friction when initially deploying we are always here to help. To give your account manager as much context as possible you can run the container in debug mode. This will spawn Takeoff with lower-level logging, try out some tests, and then create a report. If you send the problem you are having alongside this report it will speed up the debugging process.

You can run the container in debug mode by running the following command:

Debugging Takeoff
docker run --gpus all \
    ... # Other parameters you normally pass to the container
    -v ~/.takeoff_cache:/code/models \ # Volume mount needed to get debug report
    tytn/takeoff-pro:0.16.0-gpu \ # Specify image
    debug # extra parameter needed to run the container in debug mode

The following directory takeoff_debug/ will be created in the mounted model cache here for example above we have used ~/.takeoff_cache. The following files are created:

takeoff_debug/debug_run.log - Some diagnostic information about the hardware you are running takeoff on such as cuda drivers/versions, and the responses from some basic tests.
takeoff_debug/takeoff_rust.log - The log file containing the logs from the inference server.
takeoff_debug/takeoff_python.log - The log file containing the logs from the model.

Now send the takeoff_debug/ folder alongside your issue to your account manager who will get you back on track as fast as possible.

Frequently Asked Questions

Why am I running out of memory?​

Your Model Footprint is too big for your provided GPU​

Provide more resources​

Quantization​

Your Model Footprint is less than available GPU memory but still hitting OOMs​

Is My GPU Supported?​

My Model Is Not As Fast As I Expected​

I have a problem deploying Takeoff, what should I do?​