Version: 0.20.x

Generate (Streamed)

POST /generate_stream

Generate (Streamed)

The /generate_stream endpoint is used to communicate with the LLM. Use this endpoint when you want to receive a stream of responses from the LLM, token by token. If you want your response to be returned all at once, see the /generate endpoint.

To send a batch of requests all at once, the text field can be either a string, or an array of strings. This server also supports dynamic batching, where requests in a short time interval are processed as a single batch.

The response is a stream of server sent events, where each event is a token generated by the LLM. If you've supplied a batch of inputs:

{
"text": ["1 2 3 4", "a b c d"]
}

The server sent events data fields will be a stream of json payloads, with each payload having a text field containing the token, and a batch_id field containing the index of the batch that the token belongs to.

data:{"text": "5", "batch_id": 0}

data:{"text": "e", "batch_id": 1}

data:{"text": "6", "batch_id": 0}

data:{"text": "f", "batch_id": 1}

The specific order in which the various batches' tokens are returned is not guaranteed.

Request

application/json

Body

required

consumer_group stringnullable

json_schema nullable

max_new_tokens int64nullable

min_new_tokens int64nullable

no_repeat_ngram_size int64nullable

prompt_max_tokens int64nullable

regex_string stringnullable

repetition_penalty floatnullable

sampling_temperature floatnullable

sampling_topk int64nullable

sampling_topp floatnullable

text

object

required

oneOf

MOD1
MOD2

string

Responses

Takes in a JSON payload and returns the response token by token, as a stream of server sent events.

application/json

Schema
Example (from schema)

Schema

text

object

required

oneOf

MOD1
MOD2

string

{
  "text": "string"
}

Generate (Streamed)

/generate_stream

Request​

Body

Responses​

Request

Responses