Multi-GPU Deployment
Multi-GPU deployment is only available for generative models. Given the typically small size of embedding models, Multi-GPU deployment typically isn't required.
Multi-GPU deployments support the inference of larger models by distributing LLM workloads across multiple GPUs. In practice, this allows the use of bigger batch sizes.
This feature leverages Tensor Parallelism to split inference workloads evenly across different GPUs, allowing multiple processes to run in parallel and amplifying your model's inference speed. To run a multi-GPU environment, specify multiple TAKEOFF_CUDA_VISIBLE_DEVICES
and Takeoff will distribute the model across the devices provided. This allows you to allocate different devices to different readers.
Running the multi-GPU environment is described in launching with Docker. Special attention should be paid to setting shm-size
as by default Docker allocates parallel processes with critically
restrictive shared memory buffers. We recommend setting this to 2GB
for unrestricted operation.
docker run --gpus all \
-e TAKEOFF_CUDA_VISIBLE_DEVICES="0,1" \
-e TAKEOFF_MODEL_NAME=TitanML/llama2-13b-chat-4bit-AWQ \
-e TAKEOFF_ACCESS_TOKEN=<token> \
-e TAKEOFF_DEVICE=cuda \
--shm-size=2gb \
-p 3000:3000 \
-p 3001:3001 \
-v ~/.takeoff_cache:/code/models \
tytn/takeoff-pro:0.19.1-gpu
TAKEOFF_TENSOR_PARALLEL
is deprecated as it is now automatically determined as the number of devices made visible by TAKEOFF_CUDA_VISIBLE_DEVICES
. If TAKEOFF_CUDA_VISIBLE_DEVICES
has not been specified, all devices will be visible to Takeoff but the model will only be deployed over a single device (0). TAKEOFF_TENSOR_PARALLEL
can still be specified for backwards compatibility, with its value overriding the new behaviour.