CHANGELOG
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Fixed Speculative Decoding for image models
- Fix a bug with loading Phi and Mistral models.
- Prefix Coalescing for drastically improved throughput on workloads with varying degrees of shared prefixes, enabled by default.
- Support for multiple images in image-to-text workloads.
- Ability to dynamically scale LLM deployments based on usage metrics, including scale to zero."* Support for the new Transformers Tokenizer file format.
- Improved robustness in the connection between the server and its readers.
- Improved stability for image-to-text models to prevent out-of-memory errors.
- Fixed a bug with idle multi-gpu instances failing.
- Support for outlines as a json decoding backend - use the
constrained_decoding_backend
generation parameter.
- Fixed a bug with concurrent requests with different generation parameters.
- Fixed a bug with using chat templates
- Improved experience for Vertex and ANSI-less consoles by adding
TAKEOFF_LOG_DISABLE_TIMESTAMPS
and NO_COLOR
support.
- Llama 3.2 support.
- Added tokens per second per request logging.
- Bugfixes for distributed Takeoff.
- Removed support for generative models on cpu.
- Added detokenization endpoint support for API readers.
- Added a detokenization endpoint: use takeoff to turn tokens into text.
- Enhanced gemma 2 support.
- Chunked prefilling is now enabled by default.
- Various internal optimizations: should see increased throughput throughout takeoff.
- Decreased memory usage for prefix caching.
- Fix chat templates for distributed takeoff setups.
- Fix for a bug that could reduce performance for long context Llama 3.1.
- Fix some overly verbose logging.
- Speed improvements: takeoff throughput should be >3x faster at high load.
- (Beta) prefix caching: takeoff throughput & latency will be substantially better for repeated prefixes.
- (Beta) KV cache quantization: fit larger models onto the same machine, or use larger batch sizes.
- UI/UX improvements: specify
TAKEOFF_PAGE_CACHE_SIZE=X%
as a fraction of your GPUs available memory, and takeoff will use that much.
- Chat template endpoint.
- Support for Llama 3.1.
- Support for schemaless JSON generation: pass
json_schema: {}
and the model will generate a valid json object, conforming to no particular schema.
- New JSON schema features:
oneOf
and anyOf
now supported.
- Bugfixes and performance improvements