Skip to main content

CHANGELOG

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.11.1

  • Fixed a synchronization bug that could cause a timeout when leaving the server inactive for long periods of time.

0.12.0

  • Added OpenAI compatible interface layer
  • Spacelike Speculative Decoding enabled for non-static models. Uses in memory cache for higher generation performance.
  • Support for LLava image to text models.
  • Support for Google's gemma model series

0.10.0

  • Introduced a new custom takeoff inference engine, which standardizes backend processes and offers an enhanced interface for generation models.
  • In light of the unified backend, continuous batching now works for all generation models.
  • Implemented GPU/CPU utilization tracking metrics.
  • Released takeoff_client, a Python client package on PyPI for server interaction.
  • Removed the option to select backends from the management frontend.
  • Overhauled all documentation. Add API References section.
  • Added support for Mixtral

0.11.0

  • Added support for reranking & classification models.
  • Added CUDA graph LRU caching to cap memory overheads when using CUDA graphs.
  • Reduce size of GPU image by over half
  • Fix bug where vertex integration couldn't find CUDA driver.
  • Fix bug where synchronization issues could arise when using multi-gpu

0.9.2

  • Bugfix to ensure that GPU VRAM is always cleaned up after a model is dynamically deleted.

0.9.1

  • Added /config/:reader_id endpoint to Takeoff Management API to get config.json file of the model that the reader is currently running.

0.9.0

  • Ability to configure Takeoff with a "config.yaml" file which to be used should be mounted at code/config.yaml inside Takeoff container. This enables you when starting the container to specify multiple readers and server config in a declarative fashion. You can still use environment variables to overwrite individual settings, moreĀ detailsĀ here.
  • compress-fast backend now supports splitting across multiple GPUs.

0.8.0

  • Add continuous batching for baseline, fast, compress-fast backend
  • Add licence validation for takeoff
  • Added loading readers to management frontend
  • Add the ability to cancel requests
  • Minor bug fix to speculative decoding
  • Minor bug fix to multigpu backend

0.7.1

  • Ready flag added to management api GET /reader_groups endpoint to know if model has done loading or not.
  • Redis max memory and takeoff single prompt limit are now configurable in environment variables: TAKEOFF_REDIS_MAX_MEMORY and TAKEOFF_MAX_PROMPT_STRING_BYTES. Their defaults are set to 1GB and 30KB respectively.
  • Stop ability to send generation requests to embedding model through frontend UIs.

0.7.0

  • New model memory calculator to inference frontend! You can calculate to see if your models will fit on your hardware with desired sequence length and batch size.
  • Hash history for inference and management apps to fix getting a 404 when refreshing a sub-page of app.