0.6.0

November 15, 2023

This release adds support for speculative decoding. Now a small draft model can be used to decrease model latency by drafting a response before the large model verifies it. This can increase speed 2x without affecting model outputs. This is applied be default whenever a valid student model is available, or can be controlled with the TAKEOFF_ASSISTANT_NAME environment variable.

The front end has two new features:

A metric page which shows the statistics of the responses of each model
JSON Schema support to use the controlled generation techniques introduced in 0.5.0

Features

Add speculative decoding
Add metrics dashboard
Expand JSON schema support to the front-end

Features​

Features