0.16.0

July 25, 2024

Speed improvements: takeoff throughput should be >3x faster at high load.
(Beta) prefix caching: takeoff throughput & latency will be substantially better for repeated prefixes.
(Beta) KV cache quantization: fit larger models onto the same machine, or use larger batch sizes.
UI/UX improvements: specify TAKEOFF_PAGE_CACHE_SIZE=X% as a fraction of your GPUs available memory, and takeoff will use that much.
Chat template endpoint.
Support for Llama 3.1.
Support for schemaless JSON generation: pass json_schema: {} and the model will generate a valid json object, conforming to no particular schema.
New JSON schema features: oneOf and anyOf now supported.