OpenAI Compatibility API
Takeoff has an integrated API layer to provide compatibility with OpenAI's Chat Completion API. This means that developers can now use OpenAI's existing client libraries or minimally adapt existing codebases to interact seamlessly with Takeoff via the API layer.
A full API schema for the compatibility layer is provided here.
Launch Takeoff with OpenAI compatibility layer​
The default port for this OpenAI-compatibility layer is set to 3003
. As with the other ports used by Takeoff, this port should be forwarded (using -p
) to allow interaction via the OpenAI-compatibility layer - as seen in the example below:
docker run --gpus all \
-e TAKEOFF_MODEL_NAME=gpt-3.5-turbo \
-e TAKEOFF_DEVICE=cuda \
-e LICENSE_KEY=[your_license_key] \
-e TAKEOFF_MAX_SEQUENCE_LENGTH=1024 \
-p 3000:3000 \
-p 3003:3003 \
-v ~/.takeoff_cache:/code/models \
tytn/takeoff-pro:0.19.1-gpu
By using this setup, you can directly pass queries in the OpenAI query schema to Takeoff, which will process these requests and return results using the OpenAI response schema. This feature is designed to offer developers an easy and effective way to leverage Takeoff's capabilities while maintaining compatibility with existing OpenAI-based workflows.
Interfacing with Takeoff in OpenAI schema​
For non-Streaming Response​
- Using cURL
- Using OpenAI Python Client
curl --location 'http://localhost:3003/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "primary",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is inference server?"
}
],
"stream": false
}'
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3003/v1",
api_key="not needed"
)
chat_completion = client.chat.completions.create(
model="primary", # should be consumer group in takeoff
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is deep learning?"},
],
stream=False,
)
print(chat_completion)
"""
ChatCompletion(
id='cmpl-34324d2e-a604-4e17-b9cf-57929814a8bb',
choices=[
Choice(finish_reason='length', index=0, logprobs='null', message=None, text="\nassistant: Deep learning is a branch of machine learning that involves the use of deep neural networks.")
],
created=1707301232,
model='primary',
object='text_completion',
system_fingerprint=None,
usage='unknown')
"""
For Streaming Response​
- Using cURL
- Using OpenAI Python Client
curl --location 'http://localhost:3003/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "primary",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is inference server?"
}
],
"stream": true
}'
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:3003/v1",
api_key="not needed"
)
chat_completion = client.chat.completions.create(
model="primary", # should be consumer group in takeoff
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is deep learning?"},
],
stream=True,
)
for chunk in chat_completion:
print(chunk)
"""
ChatCompletionChunk(id='cmpl-71d2bd78-91a0-442b-a270-577cf819c02f',
choices=[
Choice(delta=ChoiceDelta(content='', # <--- streaming data
function_call=None,
role='assistant',
tool_calls=None),
finish_reason='length', index=0, logprobs='null')
],
created=1707302777,
model='primary',
object='chat.completion.chunk',
system_fingerprint='fp_44709d6fcb')
"""
Important Details on Supported Parameters​
The OpenAI-compatibility API accommodates a specific set of parameters:
model
: This parameter identifies the consumer group within Takeoff that you wish to use. By default, it is set to 'primary', directing requests to the main processing group.messages
: This is an array containing the sequence of messages that represent the conversation history. It's crucial for contextual continuity in interactions.stream
: If set, returns a streaming server sent event. Default to 'false'.temperature
: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.top_p
: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.
Using other parameters (such as function_call
or functions
) are not supported, and will be disregarded if passed to Takeoff.