Skip to main content
Version: 0.12.x

Multi-GPU Deployment

Generation only

Multi-GPU deployment is only available for generative models. Given the typically small size of Embedding models, Multi-GPU deployment typically isn't required.

Multi-GPU deployments support the inference of larger models by distributing LLM workloads across multiple GPUs. In practice, this allows the use of bigger batch sizes.

This feature leverages Tensor Parallelism to split inference workloads evenly across different GPUs, allowing multiple processes to run in parallel and amplifying your model's inference speed. To run a multi-gpu environment, pass TAKEOFF_TENSOR_PARALLEL={N} to run across N gpus. You can also optionally pass the devices you intend to run on using CUDA_VISIBLE_DEVICES. Note, this could be a subset of all the GPUs available and Takeoff will distribute the model only across those provided. The number of GPUs passed should be a power of 2 [1, 2, 4, 8, ...] as layers need to be split evenly.

Running the multi-gpu environment is that described in launching with docker. Special attention should be paid to setting shm-size as by default Docker allocates parallel processes critically restrictive shared memory buffers. We recommend setting this to 2gb for unimpeded operation.

docker run --gpus all \
-e TAKEOFF_MODEL_NAME=meta-llama/Llama-2-13b \
--shm-size=2gb \
-p 3000:3000 \
-p 3001:3001 \
-v ~/.takeoff_cache:/code/models \