CHANGELOG
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Fixed a synchronization bug that could cause a timeout when leaving the server inactive for long periods of time.
- Added OpenAI compatible interface layer
- Spacelike Speculative Decoding enabled for non-static models.
Uses in memory cache for higher generation performance.
- Support for LLava image to text models.
- Support for Google's gemma model series
- Introduced a new custom takeoff inference engine, which standardizes backend processes and offers an enhanced interface for generation models.
- In light of the unified backend, continuous batching now works for all generation models.
- Implemented GPU/CPU utilization tracking metrics.
- Released
takeoff_client
, a Python client package on PyPI for server interaction.
- Removed the option to select backends from the management frontend.
- Overhauled all documentation.
Add
API References
section.
- Added support for Mixtral
- Added support for reranking & classification models.
- Added CUDA graph LRU caching to cap memory overheads when using CUDA graphs.
- Reduce size of GPU image by over half
- Fix bug where vertex integration couldn't find CUDA driver.
- Fix bug where synchronization issues could arise when using multi-gpu
- Bugfix to ensure that GPU VRAM is always cleaned up after a model is dynamically deleted.
- Added
/config/:reader_id
endpoint to Takeoff Management API to get config.json file of the model that the reader is currently running.
- Ability to configure Takeoff with a "config.yaml" file which to be used should be mounted at code/config.yaml inside Takeoff container.
This enables you when starting the container to specify multiple readers and server config in a declarative fashion.
You can still use environment variables to overwrite individual settings, moreĀ detailsĀ here.
compress-fast
backend now supports splitting across multiple GPUs.
- Add continuous batching for baseline, fast, compress-fast backend
- Add licence validation for takeoff
- Added loading readers to management frontend
- Add the ability to cancel requests
- Minor bug fix to speculative decoding
- Minor bug fix to
multigpu
backend
- Ready flag added to management api GET
/reader_groups
endpoint to know if model has done loading or not.
- Redis max memory and takeoff single prompt limit are now configurable in environment variables:
TAKEOFF_REDIS_MAX_MEMORY
and TAKEOFF_MAX_PROMPT_STRING_BYTES
.
Their defaults are set to 1GB and 30KB respectively.
- Stop ability to send generation requests to embedding model through frontend UIs.
- New model memory calculator to inference frontend!
You can calculate to see if your models will fit on your hardware with desired sequence length and batch size.
- Hash history for inference and management apps to fix getting a 404 when refreshing a sub-page of app.