Skip to main content

CHANGELOG

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.21.2

  • Fixed Speculative Decoding for image models

0.21.1

  • Fix a bug with loading Phi and Mistral models.

0.21.0

  • Prefix Coalescing for drastically improved throughput on workloads with varying degrees of shared prefixes, enabled by default.
  • Support for multiple images in image-to-text workloads.
  • Ability to dynamically scale LLM deployments based on usage metrics, including scale to zero."* Support for the new Transformers Tokenizer file format.
  • Improved robustness in the connection between the server and its readers.
  • Improved stability for image-to-text models to prevent out-of-memory errors.
  • Fixed a bug with idle multi-gpu instances failing.

0.20.0

  • Support for outlines as a json decoding backend - use the constrained_decoding_backend generation parameter.
  • Fixed a bug with concurrent requests with different generation parameters.

0.19.1

  • Fixed a bug with using chat templates
  • Improved experience for Vertex and ANSI-less consoles by adding TAKEOFF_LOG_DISABLE_TIMESTAMPS and NO_COLOR support.

0.19.0

  • Llama 3.2 support.
  • Added tokens per second per request logging.

0.18.0

  • Bugfixes for distributed Takeoff.
  • Removed support for generative models on cpu.
  • Added detokenization endpoint support for API readers.

0.17.0

  • Added a detokenization endpoint: use takeoff to turn tokens into text.
  • Enhanced gemma 2 support.
  • Chunked prefilling is now enabled by default.
  • Various internal optimizations: should see increased throughput throughout takeoff.
  • Decreased memory usage for prefix caching.
  • Fix chat templates for distributed takeoff setups.
  • Fix for a bug that could reduce performance for long context Llama 3.1.
  • Fix some overly verbose logging.

0.16.0

  • Speed improvements: takeoff throughput should be >3x faster at high load.
  • (Beta) prefix caching: takeoff throughput & latency will be substantially better for repeated prefixes.
  • (Beta) KV cache quantization: fit larger models onto the same machine, or use larger batch sizes.
  • UI/UX improvements: specify TAKEOFF_PAGE_CACHE_SIZE=X% as a fraction of your GPUs available memory, and takeoff will use that much.
  • Chat template endpoint.
  • Support for Llama 3.1.
  • Support for schemaless JSON generation: pass json_schema: {} and the model will generate a valid json object, conforming to no particular schema.
  • New JSON schema features: oneOf and anyOf now supported.

0.15.2

  • Bugfixes and performance improvements