0.6.0
This release adds support for speculative decoding.
Now a small draft model can be used to decrease model latency by drafting a response before the large model verifies it.
This can increase speed 2x without affecting model outputs.
This is applied be default whenever a valid student model is available, or can be controlled with the TAKEOFF_ASSISTANT_NAME
environment variable.
The front end has two new features:
- A metric page which shows the statistics of the responses of each model
- JSON Schema support to use the controlled generation techniques introduced in 0.5.0
Features​
- Add speculative decoding
- Add metrics dashboard
- Expand JSON schema support to the front-end