Takeoff 0.21.2 is released 🎉 Speak with us to find out more: hello@titanml.co
- Speed improvements: takeoff throughput should be >3x faster at high load.
- (Beta) prefix caching: takeoff throughput & latency will be substantially better for repeated prefixes.
- (Beta) KV cache quantization: fit larger models onto the same machine, or use larger batch sizes.
- UI/UX improvements: specify
TAKEOFF_PAGE_CACHE_SIZE=X%
as a fraction of your GPUs available memory, and takeoff will use that much.
- Chat template endpoint.
- Support for Llama 3.1.
- Support for schemaless JSON generation: pass
json_schema: {}
and the model will generate a valid json object, conforming to no particular schema.
- New JSON schema features:
oneOf
and anyOf
now supported.