Skip to main content

0.16.0

  • Speed improvements: takeoff throughput should be >3x faster at high load.
  • (Beta) prefix caching: takeoff throughput & latency will be substantially better for repeated prefixes.
  • (Beta) KV cache quantization: fit larger models onto the same machine, or use larger batch sizes.
  • UI/UX improvements: specify TAKEOFF_PAGE_CACHE_SIZE=X% as a fraction of your GPUs available memory, and takeoff will use that much.
  • Chat template endpoint.
  • Support for Llama 3.1.
  • Support for schemaless JSON generation: pass json_schema: {} and the model will generate a valid json object, conforming to no particular schema.
  • New JSON schema features: oneOf and anyOf now supported.