CHANGELOG

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.11.1

March 6, 2024

Fixed a synchronization bug that could cause a timeout when leaving the server inactive for long periods of time.

February 22, 2024

Added OpenAI compatible interface layer
Spacelike Speculative Decoding enabled for non-static models. Uses in memory cache for higher generation performance.
Support for LLava image to text models.
Support for Google's gemma model series

January 22, 2024

Introduced a new custom takeoff inference engine, which standardizes backend processes and offers an enhanced interface for generation models.
In light of the unified backend, continuous batching now works for all generation models.
Implemented GPU/CPU utilization tracking metrics.
Released takeoff_client, a Python client package on PyPI for server interaction.
Removed the option to select backends from the management frontend.
Overhauled all documentation. Add API References section.
Added support for Mixtral

January 22, 2024

January 19, 2024

Bugfix to ensure that GPU VRAM is always cleaned up after a model is dynamically deleted.

January 15, 2024

Added /config/:reader_id endpoint to Takeoff Management API to get config.json file of the model that the reader is currently running.

January 11, 2024

Ability to configure Takeoff with a "config.yaml" file which to be used should be mounted at code/config.yaml inside Takeoff container. This enables you when starting the container to specify multiple readers and server config in a declarative fashion. You can still use environment variables to overwrite individual settings, more details here.
compress-fast backend now supports splitting across multiple GPUs.

December 8, 2023

November 30, 2023

Ready flag added to management api GET /reader_groups endpoint to know if model has done loading or not.
Redis max memory and takeoff single prompt limit are now configurable in environment variables: TAKEOFF_REDIS_MAX_MEMORY and TAKEOFF_MAX_PROMPT_STRING_BYTES. Their defaults are set to 1GB and 30KB respectively.
Stop ability to send generation requests to embedding model through frontend UIs.

November 29, 2023

New model memory calculator to inference frontend! You can calculate to see if your models will fit on your hardware with desired sequence length and batch size.
Hash history for inference and management apps to fix getting a 404 when refreshing a sub-page of app.