📄️ Model memory management
One of the most important things to consider when running a model is how much memory it will use. This is especially important when running large models on a GPU, as the memory may be limited. If you run out of memory, the model will crash and may need to be restarted manually. This can be very frustrating, especially if you have deployed the model to a server and are running it remotely. If you have the luxury of access to multiple GPUs, these memory requirements can be (in effect) combined by taking advantage of multi-gpu deployment.
📄️ Picking Generation Parameters
Choosing the right generation parameters is more of an art than a science. We have some experience with this, and so provide here two examples of sets of parameters that have proven useful to us when using generative models.
📄️ Takeoff Inference Engine
Inference engines are the software used to run large language models to get output text, with their design strongly influencing overall performance. Whilst there's many existing, highly performant engines, we decided to design our own one using our expertise in Inference optimisation.
📄️ Batching strategies
An important optimization for language model serving is batching. Batching is the process of combining multiple requests together into a single call to the model. This allows the model to process the content of multiple requests in parallel, which, given GPU's substantial capability for parallelisation can significantly increase the throughput of the server.
📄️ Rust Server
Titan Takeoff uses an optimised server built with Rust, rather than the FastAPI server used in the open source version of Titan Takeoff. In practice we saw a 10x throughput increase with the rust server.