Generation endpoints
To run inference with takeoff, simply POST a JSON payload containing text
(the prompt) and parameters to the REST API endpoint, which will return a JSON response containing the generated text.
There are two generation endpoints: /generate
and /generate_stream
: /generate
will return the entire response at once, whilst generate_stream returns the response as a stream of Server-sent events. Server-sent events are handled automatically by the Python client package, and are described in more detail here.
Both of these endpoints support batching via continuous batching or through submission of a user-compiled batch.
Takeoff also supports models for sequence embedding via the embed
endpoint. See the docs for that endpoint here.
Examples​
Takeoff can be interfaced with via the REST API, the GUI, or through our Python client.
- Python (Takeoff Client)
- Python (requests)
- Javascript
- cURL
# Ensure the 'takeoff_client' package is installed
# To install it, use the command: `pip install takeoff_client`
from takeoff_client import TakeoffClient
client = TakeoffClient(base_url="http://localhost", port=3000)
generator = client.generate_stream('List 3 things to do in London.',
sampling_temperature=0.1,
no_repeat_ngram_size=3)
for event in generator:
print(event.data)
import requests
input_text = 'List 3 things to do in London.'
url = "http://localhost:3000/generate_stream"
# add the generation parameters to the json payload
json = {
"text":input_text,
"sampling_temperature":0.1,
"no_repeat_ngram_size":3
}
response = requests.post(url, json=json, stream=True)
response.encoding = 'utf-8'
for text in response.iter_content(chunk_size=1, decode_unicode=True):
if text:
print(text, end="", flush=True)
// import the axios library: to install, run `npm install axios`
// in browser, use <script src="https://unpkg.com/axios/dist/axios.min.js"></script>
// or add to your build process
import axios from "axios";
let input_text = "List 3 things to do in London.";
let url = "http://localhost:3000/generate_stream";
// add the generation parameters to the json payload
let json = {
text: input_text,
sampling_temperature: 0.1,
no_repeat_ngram_size: 3,
};
axios({
method: "post",
url: url,
data: json,
responseType: "stream",
})
.then(function (response) {
response.data.on("data", (chunk) => {
console.log(chunk.toString());
});
response.data.on("end", () => {
console.log("Stream complete");
});
})
.catch(function (error) {
console.log(error);
});
curl -X POST "http://localhost:3000/generate_stream" -H "accept: application/json" -H "Content-Type: application/json" -d "{\"text\":\"List 3 things to do in London.\",\"sampling_temperature\":0.1,\"no_repeat_ngram_size\":3}"
Generation Request Parameters​
Generation parameters​
Takeoff lets you shape your model's output via the following standard generation parameters.
Parameter Name | Description | Default Value |
---|---|---|
sampling_topk | Sample predictions from the top K most probable candidates | 1 |
sampling_topp | Sample from set of tokens whose cumulative probability exceeds this value | 1.0 (no restriction) |
sampling_temperature | Sample with randomness. Bigger temperatures are associated with more randomness and 'creativity'. | 1.0 |
repetition_penalty | Penalise the generation of tokens that have been generated before. Set to > 1 to penalize. | 1 (no penalty) |
no_repeat_ngram_size | Prevent repetitions of ngrams of this size. | 0 (turned off) |
max_new_tokens | The maximum number of tokens that the model will generate in response to a prompt. | 128 |
min_new_tokens | The minimum number of tokens that the model will generate in response to a prompt. | 1 |
prompt_max_tokens | The maximum length (in tokens) for this prompt. Prompts longer than this value will be truncated. | None (truncation only to model context length) |
regex_str | The regex string which generations will adhere to as they decode. | None |
json_schema | The JSON Schema which generations will adhere to as they decode. Ignored if regex_str is set. | None |
consumer_group | The consumer group to which to send the request. | 'primary' |
The sampling_topk
, sampling_topp
and sampling_temperature
parameters are explained in detail here. Learn more about picking generation parameters for your task here.
Buffered vs Streamed responses​
Buffered Response​
The /generate
endpoint returns the entire text once it has finished generating. This is ideal for batched jobs or jobs without a real-time user-facing interface.
- Python (Takeoff API Client)
- Python (requests)
- Javascript
- cURL
from takeoff_client import TakeoffClient
client = TakeoffClient(base_url="http://localhost", port=3000)
generated_text = client.generate('List 3 things to do in London.',
sampling_temperature=0.1,
no_repeat_ngram_size=3)
print(generated_text)
import requests
input_text = 'List 3 things to do in London.'
url = "http://localhost:3000/generate"
json = {"text":input_text}
response = requests.post(url, json=json)
print(response.json())
import axios from "axios";
let input_text = "List 3 things to do in London.";
let url = "http://localhost:3000/generate";
let json = { text: input_text };
axios({
method: "post",
url: url,
data: json,
})
.then(function (response) {
console.log(response.data);
})
.catch(function (error) {
console.log(error);
});
curl -X POST http://localhost:3000/generate -H "Content-Type: application/json" -d '{"text": "List 3 things to do in London. "}'
Streaming Response​
Responses can be generated as a stream with the /generate_stream
endpoint. Streaming responses are ideal for building interactive applications where users are expecting respones in real-time, allowing users to see the answer progressively forming in front of them.
- Python (Takeoff API Client)
- Python (requests)
- Javascript
- cURL
from takeoff_client import TakeoffClient
client = TakeoffClient(base_url="http://localhost", port=3000)
generator = client.generate_stream('List 3 things to do in London.',
sampling_temperature=0.1,
no_repeat_ngram_size=3)
for event in generator:
print(event.data)
#Returns a stream of server-sent events
# import the requests library: to install, run `pip install requests`
import requests
input_text = 'List 3 things to do in London.'
url = "http://localhost:3000/generate_stream"
json = {"text":input_text}
# Send a POST request to the API
response = requests.post(url, json=json, stream=True)
response.encoding = 'utf-8'
# iterate over the response content
for text in response.iter_content(chunk_size=1, decode_unicode=True):
if text:
# print the responses as they come in
print(text, end="", flush=True)
// import the axios library: to install, run `npm install axios`
// in browser, use <script src="https://unpkg.com/axios/dist/axios.min.js"></script>
// or add to your build process
import axios from "axios";
let input_text = "List 3 things to do in London.";
let url = "http://localhost:3000/generate_stream";
let json = { text: input_text };
axios({
method: "post",
url: url,
data: json,
responseType: "stream",
})
.then(function (response) {
response.data.on("data", (chunk) => {
console.log(chunk.toString());
});
response.data.on("end", () => {
console.log("Stream complete");
});
})
.catch(function (error) {
console.log(error);
});
curl -X POST http://localhost:3000/generate_stream -N -H "Content-Type: application/json" -d '{"text":"List 3 things to do in London"}'
Batched Inference​
A key performance gain in LLM deployment is ensuring that requests are batched together to be processed by the GPU (or other accelerator) in parallel. This can increase the throughput of your inference server dramatically. There are many natural strategies for doing so, each of which makes its own tradeoffs between throughput and latency.
The Takeoff server uses continuous batching for requests to generative models (where possible), and dynamic batching for requests to embedding models.
In Continuous batching, the batch size can change during inference, allowing incoming examples to join a batch already being processed, and be returned as it completes.
Continuous batching will automatically select the largest possible batch size given hardware requirements, however an upper limit can be set on this by specifying the TAKEOFF_MAX_BATCH_SIZE
environment variable.
Sending requests asynchronously​
All POST requests to /generate
are batched continuously by default. If you send a single request, e.g.
import requests
input_text = 'List 3 things to do in London.'
url = "http://localhost:3000/generate"
json = {"text":input_text}
response = requests.post(url, json=json)
then there won't be batching unless there are multiple users simultaneously using the endpoint. Here is an example using the python asyncio
library with single requests which will use batching to improve throughput.
async def make_request(session, url):
text = 'this is example text '
items = {'text':text,'generate_max_length':512, 'sampling_temperature':1.0, 'sampling_topk':10}
async with session.post(url, json=items) as resp:
return await resp.text()
async def main():
async with aiohttp.ClientSession() as session:
tasks = []
for i in range(200):
tasks.append(make_request(session, url))
responses = await asyncio.gather(*tasks)
start = time.time()
# Run the async event loop.
asyncio.run(main())
end = time.time()
print(f"Time taken: {end-start}")
Sending a single batch​
The other way to trigger batching is to send a list of prompts as a single request. These will all be appended to the batching queue at once and will be processed in batches, and returned to you once the entire request has been processed.
Here is an example of this:
import requests
input_text = [f'List {i} things to do in London.' for i in range(20)]
url = "http://localhost:3000/generate"
json = {"text":input_text}
response = requests.post(url, json=json)
This will send a batch of 20 prompts to the endpoint, and will return once all 20 are finished processing, which may be after multiple batches depending on how Takeoff has been configured.