Skip to main content

CHANGELOG

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.21.2

January 7, 2025

Fixed Speculative Decoding for image models

0.21.1

December 18, 2024

Fix a bug with loading Phi and Mistral models.

0.21.0

December 17, 2024

Prefix Coalescing for drastically improved throughput on workloads with varying degrees of shared prefixes, enabled by default.
Support for multiple images in image-to-text workloads.
Ability to dynamically scale LLM deployments based on usage metrics, including scale to zero."* Support for the new Transformers Tokenizer file format.
Improved robustness in the connection between the server and its readers.
Improved stability for image-to-text models to prevent out-of-memory errors.
Fixed a bug with idle multi-gpu instances failing.

0.20.0

November 20, 2024

Support for outlines as a json decoding backend - use the constrained_decoding_backend generation parameter.
Fixed a bug with concurrent requests with different generation parameters.

0.19.1

October 21, 2024

Fixed a bug with using chat templates
Improved experience for Vertex and ANSI-less consoles by adding TAKEOFF_LOG_DISABLE_TIMESTAMPS and NO_COLOR support.

0.19.0

October 10, 2024

Llama 3.2 support.
Added tokens per second per request logging.

0.18.0

September 2, 2024

Bugfixes for distributed Takeoff.
Removed support for generative models on cpu.
Added detokenization endpoint support for API readers.

0.17.0

August 14, 2024

Added a detokenization endpoint: use takeoff to turn tokens into text.
Enhanced gemma 2 support.
Chunked prefilling is now enabled by default.
Various internal optimizations: should see increased throughput throughout takeoff.
Decreased memory usage for prefix caching.
Fix chat templates for distributed takeoff setups.
Fix for a bug that could reduce performance for long context Llama 3.1.
Fix some overly verbose logging.

0.16.0

July 25, 2024

Speed improvements: takeoff throughput should be >3x faster at high load.
(Beta) prefix caching: takeoff throughput & latency will be substantially better for repeated prefixes.
(Beta) KV cache quantization: fit larger models onto the same machine, or use larger batch sizes.
UI/UX improvements: specify TAKEOFF_PAGE_CACHE_SIZE=X% as a fraction of your GPUs available memory, and takeoff will use that much.
Chat template endpoint.
Support for Llama 3.1.
Support for schemaless JSON generation: pass json_schema: {} and the model will generate a valid json object, conforming to no particular schema.
New JSON schema features: oneOf and anyOf now supported.

0.15.2

July 19, 2024

Bugfixes and performance improvements