JSON/Regex Structured Inference
Takeoff allows you to generate outputs which match a given regular expression or JSON Schema. This lets you output data with a consistent format, adhering to any type requirements.
Using Regex​
To constrain the output to follow a given regex, use the regex_string
parameter. Use of an invalid regex string will raise an error in the terminal.
Example Regex Command & Outputs
curl -X 'POST'
'http://localhost:3000/generate'
-H 'accept: application/json'
-H 'Content-Type: application/json'
-d '{
"consumer_group": "primary",
"max_new_tokens": 300,
"no_repeat_ngram_size": 3,
"regex_string": "[+-]?([0-9]*[.])?[0-9]+",
"repetition_penalty": 1.2,
"sampling_temperature": 0.9,
"sampling_topk": 10,
"sampling_topp": 0.9,
"text": "The value of pi is "
}
Outputs:
Without Regex | With Regex |
---|---|
$$\pi = \frac{22 | 3.14159265 |
Regex support is driven by interegular which supports most (but not all) of the regex specification. See the interegular docs.
Using JSON​
- To generate JSON output, first define the required structure of your output in terms of a JSON Schema. An easy way to do this is to use pydantic and its inbuilt converter.
- Pass the outputted schema as json with the json_schema parameter.
Example Pydantic Model, JSON Schema & Output
- Pydantic Model
- Generated JSON Schema
- Output JSON
This example is designed to extract information about Zagreb, Croatia.
import json
from enum import Enum
from typing import List
from pydantic import BaseModel, Field
class Country(BaseModel):
name: str
class Schema(BaseModel):
country: Country
dasl: int = Field(alias='M above sea level')
city_districts: List[str] = Field(alias='Example districts in city')
population: int = Field(alias='Total Population of city')
dimensions: List[int] = Field(alias='Dimensions of city in miles')
mayor: str = Field(alias='First mayor')
print((Schema.model_schema_json()))
{
"title": "Schema",
"type": "object",
"properties": {
"country": {
"$ref": "#/definitions/Country"
},
"M above sea level": {
"title": "M Above Sea Level",
"type": "integer"
},
"Example districts in city": {
"title": "Example Districts In City",
"type": "array",
"items": {
"type": "string"
}
},
"Total Population of city": {
"title": "Total Population Of City",
"type": "integer"
},
"Dimensions of city in miles": {
"title": "Dimensions Of City In Miles",
"type": "array",
"items": {
"type": "integer"
}
},
"First mayor": {
"title": "First Mayor",
"type": "string"
}
},
"required": [
"country",
"M above sea level",
"Example districts in city",
"Total Population of city",
"Dimensions of city in miles",
"First mayor"
],
"definitions": {
"Country": {
"title": "Country",
"type": "object",
"properties": {
"name": {
"title": "Name",
"type": "string"
}
},
"required": [
"name"
]
}
}
}
Prompting for extraction with the introduction of the Wikipedia page for Zagreb, Croatia yields:
{
"Total Population of city": 767131,
"Example districts in city": [
"Podsljeme",
"Sesvete"
],
"Dimensions of city in miles": [
19,
12
],
"M above sea level": 158,
"country": {
"name": "Croatia"
},
"First mayor": "Janko Kamauf"
}
- Python (Takeoff API Client)
- Python (requests)
- cUrl
# Ensure the 'takeoff_client' package is installed
# To install it, use the command: `pip install takeoff_client`
from takeoff_client import TakeoffClient
from typing import List
from pydantic import BaseModel, Field
class Country(BaseModel):
name: str
class Schema(BaseModel):
country: Country
dasl: int = Field(alias='M above sea level')
city_districts: List[str] = Field(alias='Example districts in city')
population: int = Field(alias='Total Population of city')
dimensions: List[int] = Field(alias='Dimensions of city in miles')
mayor: str = Field(alias='First mayor')
client = TakeoffClient(base_url="http://localhost", port=3000)
input_text = """<Information about Zagreb>
Extract the required information about Zagreb."""
generated_text = client.generate(input_text,
json_schema = Schema.schema_json(),
sampling_temperature=0.1,
no_repeat_ngram_size=3)
print(generated_text)
import requests
from typing import List
from pydantic import BaseModel, Field
class Country(BaseModel):
name: str
class Schema(BaseModel):
country: Country
dasl: int = Field(alias='M above sea level')
city_districts: List[str] = Field(alias='Example districts in city')
population: int = Field(alias='Total Population of city')
dimensions: List[int] = Field(alias='Dimensions of city in miles')
mayor: str = Field(alias='First mayor')
input_text = """<Information about Zagreb>
Extract the required information about Zagreb."""
url = "http://localhost:3000/generate"
json = {"text":input_text, "json_schema": Schema.schema_json()}
response = requests.post(url, json=json)
print(response.json())
curl -X 'POST' \
'http://localhost:3000/generate' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"consumer_group": "primary",
"max_new_tokens": 300,
"no_repeat_ngram_size": 3,
"json_schema": {
"type": "object",
"properties": {
"activity": {
"type": "string"
}
},
"required": [
"activity"
]
},
"repetition_penalty": 1.2,
"sampling_temperature": 0.9,
"sampling_topk": 10,
"sampling_topp": 0.9,
"text": "What can a person do at a lake?"
}'
- When using the GUI, only syntactically valid JSON can be submitted. This does not check whether the input is valid Json Schema. You can use a tool like this to check, but currently the best way to ensure success is to create the schema via pydantic.
The entire JSON Schema specification is not yet supported in Takeoff or by Pydantic. The majority of features expressible in Pydantic are available for use with Takeoff, but key exceptions include:
- Field attributes other than min_length/max_length.
- Tuples - Use lists with bounded lengths instead.
See more on the differences between the Pydantic specification and JSON Schema here.
Tips​
- As the JSON schema's keys will be the tokens immediately prior to any generated tokens, you can add extra context into the keys. This is particularly useful to add things like units to use e.g. "Height in CM".
- Fields which aren't marked as required are more often than not ignored, even when they likely should be generated. If a field is missing, its probably because it hasn't been marked as required.
- For performance reasons, the order of keys generated by a JSON scheme cannot be fixed, and should instead be rectified in post-processing.
- JSON and Regex cannot currently be used together. If submitting values for both, then the Regex one will be used and a warning issued on the terminal.
- Performance can be highly sensitive to the tokens at the end of the prompt. Particularly, if your results are not as expected, try adding/removing a space or newline (or both) at the end of your prompt.
- Performance could be improved by prompting the model to adhere to the format - adding something like "Output the answer as a list of 5 items" or "Output the values for each required item" may work.
Supported Models & Backends​
Structured inference is supported on all causal models (e.g. not BART).