Version: 0.11.x

Generation endpoints

To run inference with takeoff, simply POST a JSON payload containing text (the prompt) and parameters to the REST API endpoint, which will return a JSON response containing the generated text.

There are two generation endpoints: /generate and /generate_stream: /generate will return the entire response at once, whilst generate_stream returns the response as a stream of Server-sent events. Server-sent events are handled automatically by the Python client package, and are described in more detail here.

Both of these endpoints support batching via continuous batching or through submission of a user-compiled batch.

note

Takeoff also supports models for sequence embedding via the embed endpoint. See the docs for that endpoint here, or for classification & reranking models, see here.

Examples

Takeoff can be interfaced with via the REST API, the GUI, or through our Python client.

Python (Takeoff Client)
Python (requests)
Javascript
cURL

# Ensure the 'takeoff_client' package is installed
# To install it, use the command: `pip install takeoff_client`
from takeoff_client import TakeoffClient

client = TakeoffClient(base_url="http://localhost", port=3000)

generator = client.generate_stream('List 3 things to do in London.',
                                sampling_temperature=0.1, 
                                 no_repeat_ngram_size=3)
for event in generator:
    print(event.data)

import requests

input_text = 'List 3 things to do in London.'
url = "http://localhost:3000/generate_stream"

# add the generation parameters to the json payload
json = {
    "text":input_text,
    "sampling_temperature":0.1,
    "no_repeat_ngram_size":3
    }

response = requests.post(url, json=json, stream=True)
response.encoding = 'utf-8'

for text in response.iter_content(chunk_size=1, decode_unicode=True):
    if text:
        print(text, end="", flush=True)

// import the axios library: to install, run `npm install axios`
// in browser, use <script src="https://unpkg.com/axios/dist/axios.min.js"></script>
// or add to your build process
import axios from "axios";

let input_text = "List 3 things to do in London.";
let url = "http://localhost:3000/generate_stream";
// add the generation parameters to the json payload
let json = {
  text: input_text,
  sampling_temperature: 0.1,
  no_repeat_ngram_size: 3,
};

axios({
  method: "post",
  url: url,
  data: json,
  responseType: "stream",
})
  .then(function (response) {
    response.data.on("data", (chunk) => {
      console.log(chunk.toString());
    });

    response.data.on("end", () => {
      console.log("Stream complete");
    });
  })
  .catch(function (error) {
    console.log(error);
  });

generation_parameters.sh

curl -X POST "http://localhost:3000/generate_stream" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"text\":\"List 3 things to do in London.\",\"sampling_temperature\":0.1,\"no_repeat_ngram_size\":3}"

Generation Request Parameters

Generation parameters

Takeoff lets you shape your model's output via the following standard generation parameters.

Parameter Name	Description	Default Value
sampling_topk	Sample predictions from the top K most probable candidates	1
sampling_topp	Sample from set of tokens whose cumulative probability exceeds this value	1.0 (no restriction)
sampling_temperature	Sample with randomness. Bigger temperatures are associated with more randomness and 'creativity'.	1.0
repetition_penalty	Penalise the generation of tokens that have been generated before. Set to > 1 to penalize.	1 (no penalty)
no_repeat_ngram_size	Prevent repetitions of ngrams of this size.	0 (turned off)
max_new_tokens	The maximum number of tokens that the model will generate in response to a prompt.	128
min_new_tokens	The minimum number of tokens that the model will generate in response to a prompt.	1
prompt_max_tokens	The maximum length (in tokens) for this prompt. Prompts longer than this value will be truncated.	None (truncation only to model context length)
regex_str	The regex string which generations will adhere to as they decode.	None
json_schema	The JSON Schema which generations will adhere to as they decode. Ignored if `regex_str` is set.	None
consumer_group	The consumer group to which to send the request.	'primary'

The sampling_topk, sampling_topp and sampling_temperature parameters are explained in detail here. Learn more about picking generation parameters for your task here.

Buffered vs Streamed responses

Buffered Response

The /generate endpoint returns the entire text once it has finished generating. This is ideal for batched jobs or jobs without a real-time user-facing interface.

Python (Takeoff API Client)
Python (requests)
Javascript
cURL

from takeoff_client import TakeoffClient

client = TakeoffClient(base_url="http://localhost", port=3000)

generated_text = client.generate('List 3 things to do in London.',
                                sampling_temperature=0.1, 
                                 no_repeat_ngram_size=3)
print(generated_text)

import requests

input_text = 'List 3 things to do in London.'

url = "http://localhost:3000/generate"
json = {"text":input_text}

response = requests.post(url, json=json)
print(response.json())

import axios from "axios";

let input_text = "List 3 things to do in London.";
let url = "http://localhost:3000/generate";
let json = { text: input_text };

axios({
  method: "post",
  url: url,
  data: json,
})
  .then(function (response) {
    console.log(response.data);
  })
  .catch(function (error) {
    console.log(error);
  });

curl -X POST http://localhost:3000/generate -H "Content-Type: application/json" -d '{"text": "List 3 things to do in London. "}'

Streaming Response

Responses can be generated as a stream with the /generate_stream endpoint. Streaming responses are ideal for building interactive applications where users are expecting respones in real-time, allowing users to see the answer progressively forming in front of them.

Python (Takeoff API Client)
Python (requests)
Javascript
cURL

from takeoff_client import TakeoffClient

client = TakeoffClient(base_url="http://localhost", port=3000)

generator = client.generate_stream('List 3 things to do in London.',
                                sampling_temperature=0.1, 
                                 no_repeat_ngram_size=3)
for event in generator:
    print(event.data)

#Returns a stream of server-sent events

# import the requests library: to install, run `pip install requests`
import requests

input_text = 'List 3 things to do in London.'

url = "http://localhost:3000/generate_stream"
json = {"text":input_text}

# Send a POST request to the API
response = requests.post(url, json=json, stream=True)
response.encoding = 'utf-8'

# iterate over the response content
for text in response.iter_content(chunk_size=1, decode_unicode=True):
    if text:
        # print the responses as they come in
        print(text, end="", flush=True)

// import the axios library: to install, run `npm install axios`
// in browser, use <script src="https://unpkg.com/axios/dist/axios.min.js"></script>
// or add to your build process
import axios from "axios";

let input_text = "List 3 things to do in London.";
let url = "http://localhost:3000/generate_stream";
let json = { text: input_text };

axios({
  method: "post",
  url: url,
  data: json,
  responseType: "stream",
})
  .then(function (response) {
    response.data.on("data", (chunk) => {
      console.log(chunk.toString());
    });

    response.data.on("end", () => {
      console.log("Stream complete");
    });
  })
  .catch(function (error) {
    console.log(error);
  });

curl -X POST http://localhost:3000/generate_stream -N -H "Content-Type: application/json" -d '{"text":"List 3 things to do in London"}'

Batched Inference

A key performance gain in LLM deployment is ensuring that requests are batched together to be processed by the GPU (or other accelerator) in parallel. This can increase the throughput of your inference server dramatically. There are many natural strategies for doing so, each of which makes its own tradeoffs between throughput and latency.

The Takeoff server uses continuous batching for requests to generative models (where possible), and dynamic batching for requests to embedding models.

In Continuous batching, the batch size can change during inference, allowing incoming examples to join a batch already being processed, and be returned as it completes. Continuous batching will automatically select the largest possible batch size given hardware requirements, however an upper limit can be set on this by specifying the TAKEOFF_MAX_BATCH_SIZE environment variable.

Sending requests asynchronously

All POST requests to /generate are batched continuously by default. If you send a single request, e.g.

import requests

input_text = 'List 3 things to do in London.'

url = "http://localhost:3000/generate"
json = {"text":input_text}

response = requests.post(url, json=json)

then there won't be batching unless there are multiple users simultaneously using the endpoint. Here is an example using the python asyncio library with single requests which will use batching to improve throughput.

async def make_request(session, url):
    text = 'this is example text '
    items = {'text':text,'generate_max_length':512, 'sampling_temperature':1.0, 'sampling_topk':10}
    async with session.post(url, json=items) as resp:
        return await resp.text()

async def main():
    async with aiohttp.ClientSession() as session:
        tasks = []
        for i in range(200):
            tasks.append(make_request(session, url))
        responses = await asyncio.gather(*tasks)


start = time.time()

# Run the async event loop.
asyncio.run(main())
end = time.time()

print(f"Time taken: {end-start}")

Sending a single batch

The other way to trigger batching is to send a list of prompts as a single request. These will all be appended to the batching queue at once and will be processed in batches, and returned to you once the entire request has been processed.

Here is an example of this:

import requests 

input_text = [f'List {i} things to do in London.' for i in range(20)]

url = "http://localhost:3000/generate"
json = {"text":input_text}

response = requests.post(url, json=json)

This will send a batch of 20 prompts to the endpoint, and will return once all 20 are finished processing, which may be after multiple batches depending on how Takeoff has been configured.

Generation endpoints

Examples​

Generation Request Parameters​

Generation parameters​

Buffered vs Streamed responses​

Buffered Response​

Streaming Response​

Batched Inference​

Sending requests asynchronously​

Sending a single batch​