CHANGELOG
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.17.0
- Added a detokenization endpoint: use takeoff to turn tokens into text.
- Enhanced gemma 2 support.
- Chunked prefilling is now enabled by default.
- Various internal optimizations: should see increased throughput throughout takeoff.
- Decreased memory usage for prefix caching.
- Fix chat templates for distributed takeoff setups.
- Fix for a bug that could reduce performance for long context Llama 3.1.
- Fix some overly verbose logging.
0.16.0
- Speed improvements: takeoff throughput should be >3x faster at high load.
- (Beta) prefix caching: takeoff throughput & latency will be substantially better for repeated prefixes.
- (Beta) KV cache quantization: fit larger models onto the same machine, or use larger batch sizes.
- UI/UX improvements: specify
TAKEOFF_PAGE_CACHE_SIZE=X%
as a fraction of your GPUs available memory, and takeoff will use that much. - Chat template endpoint.
- Support for Llama 3.1.
- Support for schemaless JSON generation: pass
json_schema: {}
and the model will generate a valid json object, conforming to no particular schema. - New JSON schema features:
oneOf
andanyOf
now supported.
0.15.2
- Bugfixes and performance improvements
0.15.1
- Bugfixes and performance improvements
0.15.0
- Distributed takeoff: distribute a set of takeoff containers over multiple machines
0.14.4
- Snowflake Integration with Takeoff! See our docs for more information.
- New AWQ kernels with improved performance.
- Internal throughput optimisations.
0.14.3
- Internal bugfixes and optimisations relating to: docker permissions when volume mounting model cache, better python GIL management, and token caching.
0.14.2
- Support Question Answering models in Takeoff - see this section in the classification docs for examples of how to use QA models.
0.14.1
- Support for Llama 3
0.14.0
- Fully enabled SSD for static models
- Tokenization endpoint to get tokenized text for any live reader
- Support for Llava 1.6 models
- Introduce new AWQ kernel with significantly lower memory overhead.
- Updated LangChain integration, unified
TitanTakeoff
andTitanTakeoffPro
, integrations use management api to spin up models, added text embedding support with TitanTakeoffEmbed.
0.13.2
- Fixed issue with multi-gpu inference with models that have a bias in their attention linear layers.
0.13.1
- Fixed the configuration issue with the entrypoint for Mistral embedding models.
- Fixed the issue with continuous batching that was causing performance degradation.
- Added tokenization endpoint in takeoff.
0.13.0
- Support for inline images in image to text models.
You can now supply an image to the
image_generate
(andimage_generate_stream
) endpoint in the form:<image:https://url.com/image.jpg>
. - Debug script for diagnosing issues with takeoff deployments.
- Support for Jina's long context embedding models.
- Support for Mistral based embedding models
- Support for API based (openAI) model calls from takeoff.
- Changes to default memory usage parameters to reduce the likelihood of OOM errors.
- Fix a bug where model downloading was not properly atomic. This means that a failed model download will no longer cause issues for subsequent launches.
- Fix a bug where the CPU container was larger than it should have been
- Assorted performance improvements and bugfixes
- Remove the ability to manually specify the backend that's used by takeoff.
0.12.0
- Added OpenAI compatible interface layer
- Spacelike Speculative Decoding enabled for non-static models. Uses in memory cache for higher generation performance.
- Support for LLava image to text models.
- Support for Google's gemma model series
0.11.1
- Fixed a synchronization bug that could cause a timeout when leaving the server inactive for long periods of time.
0.11.0
- Added support for reranking & classification models.
- Added CUDA graph LRU caching to cap memory overheads when using CUDA graphs.
- Reduce size of GPU image by over half
- Fix bug where vertex integration couldn't find CUDA driver.
- Fix bug where synchronization issues could arise when using multi-gpu
0.10.0
- Introduced a new custom takeoff inference engine, which standardizes backend processes and offers an enhanced interface for generation models.
- In light of the unified backend, continuous batching now works for all generation models.
- Implemented GPU/CPU utilization tracking metrics.
- Released
takeoff_client
, a Python client package on PyPI for server interaction. - Removed the option to select backends from the management frontend.
- Overhauled all documentation.
Add
API References
section. - Added support for Mixtral
0.9.2
- Bugfix to ensure that GPU VRAM is always cleaned up after a model is dynamically deleted.
0.9.1
- Added
/config/:reader_id
endpoint to Takeoff Management API to get config.json file of the model that the reader is currently running.
0.9.0
- Ability to configure Takeoff with a "config.yaml" file which to be used should be mounted at code/config.yaml inside Takeoff container. This enables you when starting the container to specify multiple readers and server config in a declarative fashion. You can still use environment variables to overwrite individual settings, more details here.
compress-fast
backend now supports splitting across multiple GPUs.
0.8.0
- Add continuous batching for baseline, fast, compress-fast backend
- Add licence validation for takeoff
- Added loading readers to management frontend
- Add the ability to cancel requests
- Minor bug fix to speculative decoding
- Minor bug fix to
multigpu
backend
0.7.1
- Ready flag added to management api GET
/reader_groups
endpoint to know if model has done loading or not. - Redis max memory and takeoff single prompt limit are now configurable in environment variables:
TAKEOFF_REDIS_MAX_MEMORY
andTAKEOFF_MAX_PROMPT_STRING_BYTES
. Their defaults are set to 1GB and 30KB respectively. - Stop ability to send generation requests to embedding model through frontend UIs.
0.7.0
- New model memory calculator to inference frontend! You can calculate to see if your models will fit on your hardware with desired sequence length and batch size.
- Hash history for inference and management apps to fix getting a 404 when refreshing a sub-page of app.
0.6.3
- Inference and Management frontend applications can now be served under paths, e.g.
https://takeoff.example.com/inference
orhttps://takeoff.example.com/playground
. This is useful for serving frontends when deploying on kubernetes and using an ingress to route traffic to your takeoff pod. - Sagemaker and Vertex AI compatible inference apis are served on 8080 and 3002 respectively and now have api documentation under
/docs
. - Minor bug fix to Playground UI where no output was displayed.
- Minor bug fixes to takeoff loading process to communicate more verbosely with api frontend.
This ensure
/healthz
is more robust and added knowledge of loading reading to API.
0.6.1
- Small adjustment to turn down default log verbosity for Takeoff users.
0.6.0
This release adds support for speculative decoding.
Now a small draft model can be used to decrease model latency by drafting a response before the large model verifies it.
This can increase speed 2x without affecting model outputs.
This is applied be default whenever a valid student model is available, or can be controlled with the TAKEOFF_ASSISTANT_NAME
environment variable.
The front end has two new features:
- A metric page which shows the statistics of the responses of each model
- JSON Schema support to use the controlled generation techniques introduced in 0.5.0
Features
- Add speculative decoding
- Add metrics dashboard
- Expand JSON schema support to the front-end
Fixes
0.5.0
Features
This release was focused on tools to integrate RAG funtionalities within Takeoff. We add support for embedding models with the BERT arcitechture. This gives an easy way to embed thousands of documents quickly. A single GPU can host a BERT model alongside one or more generative models, meaning multiple applications can be powered by a single GPU.
We also introduce controlled generation to the API. You can specify a regex string or a json scheme in the api which will guarantee that the output will match the schema / regex.
- Add structured generation: JSON + regex outputs
- Support multiple readers dynamically
- Add "prompt_max_tokens" generation parameter across backends, for truncating prompts to max number of tokens
- Frontend for model management, model selection for chat and playground UI
- Embedding (Bert) model support
0.4.3
Features
Fixes
- AWQ backend accepts safetensors as the model format in repo
0.4.2
Features
Fixes
- OOM fixed for other backends
0.4.1
Features
Fixes
- OOM fixed for HF and BNB backend
0.4.0
Features
- Bits and bytes HF 4 bit backend
- Takeoff PRO added to Iris
- Multi GPU support
- Mistral support
- API docs for takeoff
- Redis and Python reader are spun up from rust gateway
- Rust server
- Rust server serves static files
- AWQ Backend
- Batched streaming for AWQ, python reader integrates with Rust gateway
- Integration and benchmark tests for takeoff
- Regex guided generation
- Unify logging formats between rust & python, rationalise log levels
- Change batching behaviour to fix throughput issues
- Manager for redis connections in the rust server
- Conversion entrypoint for AWQ, CT2.
- Model management API PUT /models to spawn new reader with new config
- Added bitsandbytes 4bit backend
- React + Typescript Frontend