Skip to main content

CHANGELOG

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.6.3

  • Inference and Management frontend applications can now be served under paths, e.g. https://takeoff.example.com/inference or https://takeoff.example.com/playground. This is useful for serving frontends when deploying on kubernetes and using an ingress to route traffic to your takeoff pod.
  • Sagemaker and Vertex AI compatible inference apis are served on 8080 and 3002 respectively and now have api documentation under /docs.
  • Minor bug fix to Playground UI where no output was displayed.
  • Minor bug fixes to takeoff loading process to communicate more verbosely with api frontend. This ensure /healthz is more robust and added knowledge of loading reading to API.

0.6.1

  • Small adjustment to turn down default log verbosity for Takeoff users.

0.6.0

This release adds support for speculative decoding. Now a small draft model can be used to decrease model latency by drafting a response before the large model verifies it. This can increase speed 2x without affecting model outputs. This is applied be default whenever a valid student model is available, or can be controlled with the TAKEOFF_ASSISTANT_NAME environment variable.

The front end has two new features:

  1. A metric page which shows the statistics of the responses of each model
  2. JSON Schema support to use the controlled generation techniques introduced in 0.5.0

Features​

  • Add speculative decoding
  • Add metrics dashboard
  • Expand JSON schema support to the front-end

0.5.0

0.5.0​

Features​

This release was focused on tools to integrate RAG funtionalities within Takeoff. We add support for embedding models with the BERT arcitechture. This gives an easy way to embed thousands of documents quickly. A single GPU can host a BERT model alongside one or more generative models, meaning multiple applications can be powered by a single GPU.

We also introduce controlled generation to the API. You can specify a regex string or a json scheme in the api which will guarantee that the output will match the schema / regex.

  • Add structured generation: JSON + regex outputs
  • Support multiple readers dynamically
  • Add "prompt_max_tokens" generation parameter across backends, for truncating prompts to max number of tokens
  • Frontend for model management, model selection for chat and playground UI
  • Embedding (Bert) model support

0.4.0

0.4.0​

Features​

  • Bits and bytes HF 4 bit backend
  • Takeoff PRO added to Iris
  • Multi GPU support
  • Mistral support
  • API docs for takeoff
  • Redis and Python reader are spun up from rust gateway
  • Rust server
  • Rust server serves static files
  • AWQ Backend
  • Batched streaming for AWQ, python reader integrates with Rust gateway
  • Integration and benchmark tests for takeoff
  • Regex guided generation
  • Unify logging formats between rust & python, rationalise log levels
  • Change batching behaviour to fix throughput issues
  • Manager for redis connections in the rust server
  • Conversion entrypoint for AWQ, CT2.
  • Model management API PUT /models to spawn new reader with new config
  • Added bitsandbytes 4bit backend
  • React + Typescript Frontend