CHANGELOG

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.6.3

November 21, 2023

Inference and Management frontend applications can now be served under paths, e.g. https://takeoff.example.com/inference or https://takeoff.example.com/playground. This is useful for serving frontends when deploying on kubernetes and using an ingress to route traffic to your takeoff pod.
Sagemaker and Vertex AI compatible inference apis are served on 8080 and 3002 respectively and now have api documentation under /docs.
Minor bug fix to Playground UI where no output was displayed.
Minor bug fixes to takeoff loading process to communicate more verbosely with api frontend. This ensure /healthz is more robust and added knowledge of loading reading to API.

0.6.1

November 16, 2023

Small adjustment to turn down default log verbosity for Takeoff users.

This release adds support for speculative decoding. Now a small draft model can be used to decrease model latency by drafting a response before the large model verifies it. This can increase speed 2x without affecting model outputs. This is applied be default whenever a valid student model is available, or can be controlled with the TAKEOFF_ASSISTANT_NAME environment variable.

The front end has two new features:

A metric page which shows the statistics of the responses of each model
JSON Schema support to use the controlled generation techniques introduced in 0.5.0

Features

Add speculative decoding
Add metrics dashboard
Expand JSON schema support to the front-end

0.5.0

November 9, 2023

0.5.0

Features

This release was focused on tools to integrate RAG funtionalities within Takeoff. We add support for embedding models with the BERT arcitechture. This gives an easy way to embed thousands of documents quickly. A single GPU can host a BERT model alongside one or more generative models, meaning multiple applications can be powered by a single GPU.

We also introduce controlled generation to the API. You can specify a regex string or a json scheme in the api which will guarantee that the output will match the schema / regex.

Add structured generation: JSON + regex outputs
Support multiple readers dynamically
Add "prompt_max_tokens" generation parameter across backends, for truncating prompts to max number of tokens
Frontend for model management, model selection for chat and playground UI
Embedding (Bert) model support

0.4.3

November 7, 2023

0.4.3

Features

Fixes

AWQ backend accepts safetensors as the model format in repo

0.4.2

October 30, 2023

0.4.2

Features

Fixes

OOM fixed for other backends

0.4.1

October 26, 2023

0.4.1

Features

Fixes

OOM fixed for HF and BNB backend

0.4.0

October 24, 2023

0.4.0

Features

Bits and bytes HF 4 bit backend
Takeoff PRO added to Iris
Multi GPU support
Mistral support
API docs for takeoff
Redis and Python reader are spun up from rust gateway
Rust server
Rust server serves static files
AWQ Backend
Batched streaming for AWQ, python reader integrates with Rust gateway
Integration and benchmark tests for takeoff
Regex guided generation
Unify logging formats between rust & python, rationalise log levels
Change batching behaviour to fix throughput issues
Manager for redis connections in the rust server
Conversion entrypoint for AWQ, CT2.
Model management API PUT /models to spawn new reader with new config
Added bitsandbytes 4bit backend
React + Typescript Frontend

CHANGELOG

Features​

0.5.0​

Features​

0.4.3​

Features​

Fixes​

0.4.2​

Features​

Fixes​

0.4.1​

Features​

Fixes​

0.4.0​

Features​

Features

0.5.0

Features

0.4.3

Features

Fixes

0.4.2

Features

Fixes

0.4.1

Features

Fixes

0.4.0

Features