Version: 0.21.x

Multi-GPU Deployment

GENERATION ONLY

Multi-GPU deployment is only available for generative models. Given the typically small size of embedding models, Multi-GPU deployment typically isn't required.

Multi-GPU deployments support the inference of larger models by distributing LLM workloads across multiple GPUs. In practice, this allows the use of bigger batch sizes.

This feature leverages Tensor Parallelism to split inference workloads evenly across different GPUs, allowing multiple processes to run in parallel and amplifying your model's inference speed. To run a multi-GPU environment, specify multiple TAKEOFF_CUDA_VISIBLE_DEVICES and Takeoff will distribute the model across the devices provided. This allows you to allocate different devices to different readers.

Running the multi-GPU environment is described in launching with Docker. Special attention should be paid to setting shm-size as by default Docker allocates parallel processes with critically restrictive shared memory buffers. We recommend setting this to 2GB for unrestricted operation.

docker run --gpus all \
    -e TAKEOFF_CUDA_VISIBLE_DEVICES="0,1" \
    -e TAKEOFF_MODEL_NAME=TitanML/llama2-13b-chat-4bit-AWQ \
    -e TAKEOFF_ACCESS_TOKEN=<token> \
    -e TAKEOFF_DEVICE=cuda \
    --shm-size=2gb \
    -p 3000:3000 \
    -p 3001:3001 \
    -v ~/.takeoff_cache:/code/models \
    tytn/takeoff-pro:0.21.0-gpu

TAKEOFF_TENSOR_PARALLEL deprecation

TAKEOFF_TENSOR_PARALLEL is deprecated as it is now automatically determined as the number of devices made visible by TAKEOFF_CUDA_VISIBLE_DEVICES. If TAKEOFF_CUDA_VISIBLE_DEVICES has not been specified, all devices will be visible to Takeoff but the model will only be deployed over a single device (0). TAKEOFF_TENSOR_PARALLEL can still be specified for backwards compatibility, with its value overriding the new behaviour.