Skip to main content
Version: 0.13.x

Generation endpoints

To run inference with takeoff, simply POST a JSON payload containing text (the prompt) and parameters to the REST API endpoint, which will return a JSON response containing the generated text.

There are two generation endpoints: /generate and /generate_stream: /generate will return the entire response at once, whilst generate_stream returns the response as a stream of Server-sent events. Server-sent events are handled automatically by the Python client package, and are described in more detail here.

Both of these endpoints support batching via continuous batching or through submission of a user-compiled batch.

note

Takeoff also supports models for sequence embedding via the embed endpoint. See the docs for that endpoint here, or for classification & reranking models, see here.

Examples

Takeoff can be interfaced with via the REST API, the GUI, or through our Python client.

# Ensure the 'takeoff_client' package is installed
# To install it, use the command: `pip install takeoff_client`
from takeoff_client import TakeoffClient

client = TakeoffClient(base_url="http://localhost", port=3000)

generator = client.generate_stream('List 3 things to do in London.',
sampling_temperature=0.1,
no_repeat_ngram_size=3)
for event in generator:
print(event.data)

Generation Request Parameters

Generation parameters

Takeoff lets you shape your model's output via the following standard generation parameters.

Parameter NameDescriptionDefault Value
sampling_topkSample predictions from the top K most probable candidates1
sampling_toppSample from set of tokens whose cumulative probability exceeds this value1.0 (no restriction)
sampling_temperatureSample with randomness. Bigger temperatures are associated with more randomness and 'creativity'.1.0
repetition_penaltyPenalise the generation of tokens that have been generated before. Set to > 1 to penalize.1 (no penalty)
no_repeat_ngram_sizePrevent repetitions of ngrams of this size.0 (turned off)
max_new_tokensThe maximum number of tokens that the model will generate in response to a prompt.128
min_new_tokensThe minimum number of tokens that the model will generate in response to a prompt.1
prompt_max_tokensThe maximum length (in tokens) for this prompt. Prompts longer than this value will be truncated.None (truncation only to model context length)
regex_strThe regex string which generations will adhere to as they decode.None
json_schemaThe JSON Schema which generations will adhere to as they decode. Ignored if regex_str is set.None
consumer_groupThe consumer group to which to send the request.'primary'

The sampling_topk, sampling_topp and sampling_temperature parameters are explained in detail here. Learn more about picking generation parameters for your task here.

Buffered vs Streamed responses

Buffered Response

The /generate endpoint returns the entire text once it has finished generating. This is ideal for batched jobs or jobs without a real-time user-facing interface.

from takeoff_client import TakeoffClient

client = TakeoffClient(base_url="http://localhost", port=3000)

generated_text = client.generate('List 3 things to do in London.',
sampling_temperature=0.1,
no_repeat_ngram_size=3)
print(generated_text)

Streaming Response

Responses can be generated as a stream with the /generate_stream endpoint. Streaming responses are ideal for building interactive applications where users are expecting respones in real-time, allowing users to see the answer progressively forming in front of them.

from takeoff_client import TakeoffClient

client = TakeoffClient(base_url="http://localhost", port=3000)

generator = client.generate_stream('List 3 things to do in London.',
sampling_temperature=0.1,
no_repeat_ngram_size=3)
for event in generator:
print(event.data)

#Returns a stream of server-sent events

Batched Inference

A key performance gain in LLM deployment is ensuring that requests are batched together to be processed by the GPU (or other accelerator) in parallel. This can increase the throughput of your inference server dramatically. There are many natural strategies for doing so, each of which makes its own tradeoffs between throughput and latency.

The Takeoff server uses continuous batching for requests to generative models (where possible), and dynamic batching for requests to embedding models.

In Continuous batching, the batch size can change during inference, allowing incoming examples to join a batch already being processed, and be returned as it completes. Continuous batching will automatically select the largest possible batch size given hardware requirements, however an upper limit can be set on this by specifying the TAKEOFF_MAX_BATCH_SIZE environment variable.

Sending requests asynchronously

All POST requests to /generate are batched continuously by default. If you send a single request, e.g.

import requests

input_text = 'List 3 things to do in London.'

url = "http://localhost:3000/generate"
json = {"text":input_text}

response = requests.post(url, json=json)

then there won't be batching unless there are multiple users simultaneously using the endpoint. Here is an example using the python asyncio library with single requests which will use batching to improve throughput.

async def make_request(session, url):
text = 'this is example text '
items = {'text':text,'generate_max_length':512, 'sampling_temperature':1.0, 'sampling_topk':10}
async with session.post(url, json=items) as resp:
return await resp.text()

async def main():
async with aiohttp.ClientSession() as session:
tasks = []
for i in range(200):
tasks.append(make_request(session, url))
responses = await asyncio.gather(*tasks)


start = time.time()

# Run the async event loop.
asyncio.run(main())
end = time.time()

print(f"Time taken: {end-start}")

Sending a single batch

The other way to trigger batching is to send a list of prompts as a single request. These will all be appended to the batching queue at once and will be processed in batches, and returned to you once the entire request has been processed.

Here is an example of this:

import requests 

input_text = [f'List {i} things to do in London.' for i in range(20)]

url = "http://localhost:3000/generate"
json = {"text":input_text}

response = requests.post(url, json=json)

This will send a batch of 20 prompts to the endpoint, and will return once all 20 are finished processing, which may be after multiple batches depending on how Takeoff has been configured.