Version: 0.20.x

OpenAI Compatibility API

Takeoff has an integrated API layer to provide compatibility with OpenAI's Chat Completion API. This means that developers can now use OpenAI's existing client libraries or minimally adapt existing codebases to interact seamlessly with Takeoff via the API layer.

A full API schema for the compatibility layer is provided here.

Launch Takeoff with OpenAI compatibility layer

The default port for this OpenAI-compatibility layer is set to 3003. As with the other ports used by Takeoff, this port should be forwarded (using -p) to allow interaction via the OpenAI-compatibility layer - as seen in the example below:

Launching with OpenAI compatibility
docker run --gpus all \
    -e TAKEOFF_MODEL_NAME=gpt-3.5-turbo \
    -e TAKEOFF_DEVICE=cuda \
    -e LICENSE_KEY=[your_license_key] \
    -e TAKEOFF_MAX_SEQUENCE_LENGTH=1024 \
    -p 3000:3000 \
    -p 3003:3003 \
    -v ~/.takeoff_cache:/code/models \
    tytn/takeoff-pro:0.20.0-gpu

By using this setup, you can directly pass queries in the OpenAI query schema to Takeoff, which will process these requests and return results using the OpenAI response schema. This feature is designed to offer developers an easy and effective way to leverage Takeoff's capabilities while maintaining compatibility with existing OpenAI-based workflows.

Interfacing with Takeoff in OpenAI schema

For non-Streaming Response

Using cURL
Using OpenAI Python Client

curl --location 'http://localhost:3003/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
  "model": "primary",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is inference server?"
    }
   ],
   "stream": false
}'

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3003/v1",
    api_key="not needed"
)

chat_completion = client.chat.completions.create(
    model="primary", # should be consumer group in takeoff
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is deep learning?"},
    ],
    stream=False,
)
print(chat_completion)

"""
ChatCompletion(
    id='cmpl-34324d2e-a604-4e17-b9cf-57929814a8bb',
    choices=[
        Choice(finish_reason='length', index=0, logprobs='null', message=None, text="\nassistant: Deep learning is a branch of machine learning that involves the use of deep neural networks.")
    ],
    created=1707301232,
    model='primary',
    object='text_completion',
    system_fingerprint=None,
    usage='unknown')
"""

For Streaming Response

Using cURL
Using OpenAI Python Client

curl --location 'http://localhost:3003/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
  "model": "primary",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is inference server?"
    }
   ],
   "stream": true
}'

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3003/v1",
    api_key="not needed"
)

chat_completion = client.chat.completions.create(
    model="primary", # should be consumer group in takeoff
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is deep learning?"},
    ],
    stream=True,
)
for chunk in chat_completion:
    print(chunk)
"""
ChatCompletionChunk(id='cmpl-71d2bd78-91a0-442b-a270-577cf819c02f',
    choices=[
        Choice(delta=ChoiceDelta(content='',  # <--- streaming data
            function_call=None,
            role='assistant',
            tool_calls=None),
        finish_reason='length', index=0, logprobs='null')
    ],
    created=1707302777,
    model='primary',
    object='chat.completion.chunk',
    system_fingerprint='fp_44709d6fcb')
"""

Parameter mapping

Important Details on Supported Parameters

The OpenAI-compatibility API accommodates a specific set of parameters:

model: This parameter identifies the consumer group within Takeoff that you wish to use. By default, it is set to 'primary', directing requests to the main processing group.
messages: This is an array containing the sequence of messages that represent the conversation history. It's crucial for contextual continuity in interactions.
stream: If set, returns a streaming server sent event. Default to 'false'.
temperature: What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.

Using other parameters (such as function_call or functions) are not supported, and will be disregarded if passed to Takeoff.

OpenAI Compatibility API

Launch Takeoff with OpenAI compatibility layer​

Interfacing with Takeoff in OpenAI schema​

For non-Streaming Response​

For Streaming Response​

Important Details on Supported Parameters​

Launch Takeoff with OpenAI compatibility layer

Interfacing with Takeoff in OpenAI schema

For non-Streaming Response

For Streaming Response

Important Details on Supported Parameters