Skip to main content

0.6.0

This release adds support for speculative decoding. Now a small draft model can be used to decrease model latency by drafting a response before the large model verifies it. This can increase speed 2x without affecting model outputs. This is applied be default whenever a valid student model is available, or can be controlled with the TAKEOFF_ASSISTANT_NAME environment variable.

The front end has two new features:

  1. A metric page which shows the statistics of the responses of each model
  2. JSON Schema support to use the controlled generation techniques introduced in 0.5.0

Features​

  • Add speculative decoding
  • Add metrics dashboard
  • Expand JSON schema support to the front-end