Skip to main content
Version: 0.10.x

The Takeoff Inference Engine

Inference engines are the software used to run large language models to get output text, with their design strongly influencing overall performance. Whilst there's many existing, highly performant engines, we decided to design our own one using our expertise in Inference optimisation.

Combining kernel fusion, support for quantization, tensor parallel inference, selective CUDA graphs usage and continuous batching, we've designed the Takeoff Inference Engine to power the Takeoff Inference Server with superlative performance, without restricting the range of models or tasks supported (as many other inference engines do).

We're currently finalizing our white paper on the engine, with it ready for use as part of version 0.10.0 of Takeoff.