Skip to main content

0.21.0

  • Prefix Coalescing for drastically improved throughput on workloads with varying degrees of shared prefixes, enabled by default.
  • Support for multiple images in image-to-text workloads.
  • Ability to dynamically scale LLM deployments based on usage metrics, including scale to zero."* Support for the new Transformers Tokenizer file format.
  • Improved robustness in the connection between the server and its readers.
  • Improved stability for image-to-text models to prevent out-of-memory errors.
  • Fixed a bug with idle multi-gpu instances failing.