Version: 0.16.x

Generate

POST /generate_vertex

Generate

Generate a full response, designed for use with Vertex AI, it is effectively a proxy for the /generate endpoint on the inference API, with some tweaks for compatibility.

The /generate endpoint is used to communicate with the LLM. Use this endpoint when you want to receive a full response from the LLM, all at once.

To send a batch of requests all at once, the text field can be either a string, or an array of strings. This server also supports dynamic batching, where requests in a short time interval are processed as a single batch.

Request

application/json

Body

required

instances

object[]

required

Array [

consumer_group stringnullable

json_schema nullable

max_new_tokens int64nullable

min_new_tokens int64nullable

no_repeat_ngram_size int64nullable

prompt_max_tokens int64nullable

regex_string stringnullable

repetition_penalty floatnullable

sampling_temperature floatnullable

sampling_topk int64nullable

sampling_topp floatnullable

text

object

required

oneOf

MOD1
MOD2

string

]

Responses

Takes in a JSON payload and returns the response all at once.

application/json

Schema
Example (from schema)

Schema

predictions

object[]

required

Array [

text

object

required

oneOf

MOD1
MOD2

string

]

{
  "predictions": [
    {
      "text": "string"
    }
  ]
}

Generate

/generate_vertex

Request​

Body

Responses​

Request

Responses