Batching strategies
An important optimization for language model serving is batching. Batching is the process of combining multiple requests together into a single call to the model. This allows the model to process the content of multiple requests in parallel, which, given GPU's substantial capability for parallelisation can significantly increase the throughput of the server.
Batching in an inference server requires careful thought. One way to do it is at the request level. In this paradigm, requests that come in within a short time window are batched together and processed all at once. This method is also called dynamic batching. This has some drawbacks.
For one, each request must wait for the batching interval to elapse before it can be processed. This adds a fixed, unavoidable latency to each request to the server. In practise, this increase in latency is almost always worth it, since the throughput increase on GPUs is so large.
Another issue stems from the fact that the logic is performed at the request level. Consider the following situation. One user submits a request to the server that triggers the LLM to generate a 1000 word document. 9 other users submit simple requests to the server, that trigger the LLM to generate a single word. With a dynamic batch size of 10, if all of the requests come in in a single window, the 9 simple requests will have to wait for the 1000 word request to finish before they can be returned to the user. In addition, imagine that another 9 simple requests come in after the first 9 have been processed, but before the 1000 word request has finished. In practise, we want these requests to be processed as soon as possible, but with dynamic batching, they will have to wait for the 1000 word request to finish before they can be processed.
These problems can be solved by pushing the scheduling mechanism down to the token level. In this paradigm, the model is configured to process a batch of requests, but the batch can grow and shrink as the model generates tokens. The model continuously works on its 'in-progress batch'. As soon as a request is finished processing it leaves the batch in progress and is returned to the user. When the size of the models in-progress batch drops below a threshold, the model will look in a buffer (in which incoming requests are stored by default) for more requests to add to the batch in progress, until it is full.
This paradigm is called continuous batching, and is the default batching strategy for takeoff, when deploying LLMs. Non-generative models, in which there is no notion of a 'token' can only use dynamic batching (for example, embedding models). For more information about how to configure both modes in the takeoff server, see below.
Continuous batching​
Continuous batching is a generation algorithm for large language model serving, where the model inference logic is defined such that the batch of requests that the language model is working on can grow and shrink after each generated token. This allows incoming requests to join the processing batch at any time, and allows responses to be returned as soon as they are ready.
This is in contrast to dynamic batching, described in detail below, where the batch size that the model can process is fixed, and incoming requests are buffered in order to allow them to be processed in batches of that size.
Example of Continuous Batching
Without Continuous or Dynamic batching, hardware can be used inefficiently depending on when each example is submitted ('arrives'). The GPU cannot start on another example until its finished the previous one, resulting in most examples ending up in a backlog. Example 4 below has had to wait for examples 2 and 3 to finish in their entirety, despite having arrived not long after Example 2.
With Continuous batching, examples are added to the batch as they arrive, cyclically filling the available memory space by making use of selective indexing of the KV Cache. See more here
Why continuous batching?​
The continuous batching process increases the maximum throughput of the server. If you're using takeoff to process large batch workloads (think summarising a lot of documents, or responding to a long list of questions), or high traffic volumes, then continuous batching will allow you to process these requests faster, since each can join the batch in progress, closer to the time that it arrived, and leave the processing batch as soon as it is finished (rather than waiting for its neighbours in the batch to finish before being returned).
Continuous batching also decreases the average time to response for the server. So takeoff in this configuration will respond (on average) faster than the same server with dynamic batching enabled.
Dynamic batching​
Dynamic batching is a process where the size of batches is adjusted to match the stream of incoming requests. During periods of heavy traffic requests are batched together to maximise throughput, to avoid one request waiting for every request ahead of it in the queue to finish sequentially.
We achieve this by having a timeout (default 50ms) after which any queued requests are sent to the model. In the timeout interval any incoming requests are paused and wait until the timeout period is finished. Then all the requests are sent to the model to begin processing.
Example of Dynamic Batching
Without Dynamic batching, hardware can be used inefficiently depending on when each example is submitted ('arrives'). The GPU cannot start on another example until its finished the previous one, resulting in most examples ending up in a backlog. Example 4 below has had to wait for examples 2 and 3 to finish in their entirety, despite having arrived not long after Example 2. In comparison, the Dynamic batching case shown below waits for a full batch to be accumulated as to make maximal use of the GPU. This makes maximal use of parallelism, and results in the batch of examples being returned earlier. An issue arises if only three examples arrive. Rather than never returning these three examples as the batch hasn't been filled, a timeout is reached which dispatches the three examples on their own.
Were this timeout to be longer, the users who submitted these three examples would have to wait for longer. Were this timeout to be shorter - to half of that shown - and this example submission behaviour consistent, then the third submission would always have to be unnecessarily delayed whilst waiting for the first two submissions to finish. Choosing the timeout is thus important, as discussed below.
How to choose the timeout?​
Choosing the right timeout depends a lot on your expected traffic. If you are expected requests to come in one-by-one with reasonable gaps in between them, it might make sense to choose a low time, like < 1ms.
However if you expect there to be a lot of traffic, then it makes sense to optimize the throughput with a timeout. It makes sense to increase the timeout if you are consistently not processing with the max batch size you specified.
The batch sized used is logged when run in debug mode with -e TAKEOFF_LOG_LEVEL=DEBUG
. You can see it by searching for a message like the following:
Processing with batch size 32 from a queue of size 50
If the batch size is significantly lower than the max batch size, you may want to consider increasing the timeout to benefit from the throughput increase.
How to choose the max batch size?​
Increasing the batch size increases the memory required to run the model. For a given model, given hardware, and given prompt length, there will be a maximum batch size that can be run without erroring. This is difficult to predict ahead of time, so it makes sense to test and decrease the batch size if you get out-of-memory errors.
Configuring batching in Takeoff​
See here