Skip to main content
Version: Next

Chat Template Endpoint


What are chat templates?​

Instruction-tuned models are base language models that have been trained to respond to user instructions, and are often favoured in downstream applications because it is easier to prompt them to do what you want.

Instruction-tuned models are often tuned using an instruction template; a structured prompt that is used at training time to label different parts of the conversation to make it clear which parts of a complex prompt come from the user, the assistant, the system, etc. Here is an example of the chat template used for Llama-3 Instruct variants:

<|begin_of_text|>
<|start_header_id|>system<|end_header_id|>

Your system prompt goes here.<|eot_id|>

<|start_header_id|>user<|end_header_id|>

Your user message goes here<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>
# The assistant responds

Where can I find a model's chat template?​

Model chat templates are now usually stored alongside the tokenizer information in their Huggingface repo. They are stored as jinja templates. The chat template we used above can be found in the tokenizer_config.json stored at the TitanML model repo: TitanML/Meta-Llama-3-8B-Instruct-AWQ-4bit. There is a field called chat_template which stores the template.

Chat Template Endpoint​


The Takeoff chat template endpoint is an easy way to quickly format a series of messages into the correct prompt template for instruction tuned models. The endpoint takes in 3 parameters, inputs, add_generation_prompt, and reader_id.

Reader ID​

The Reader ID is a unique identifier for a reader configuration in the Takeoff Manifest. It is used to specify and distinguish between different reader setups, which may include model selection, device allocation, and other configuration parameters. yaml file. You can find the reader ID of a model by looking at the manifest file or by using the Management API. You specify the reader_id in the endpoint route. To find the chat template for reader1 you would send your request to localhost:3000/chat_template/reader1.

In the following manifest.yaml file, the readerID is reader1:

takeoff:
server_config: #Shared across readers

readers_config:
reader1: # <<< The reader id is here
model_name: "TitanML/Mistral-7B-Instruct-v0.2-AWQ-4bit"
device: "cuda"
consumer_group: "primary"
max_sequence_length: 1024
max_batch_size: 64

Inputs​

The endpoint takes as input a chat history. This is a series of messages between users. Each message has a role, and content field. An example message history that would generate the template above is the following:

[
{"role":"system", "content": "Your system prompt goes here"},
{"role":"user", "content": "Your user message goes here"},
]

This message history has a system message and a user message.

Possible Roles

Not all models support any roles. For example Mistral Instruct models only accept user and assistant roles. If you send the wrong role type the endpoint will return a 500 error.

Add Generation Prompt​

The add_generation_prompt field specifies whether you want to prompt the model to continue with the next message by starting the template for the assistant but not closing off the assistant message. This should be set to True if you are expecting the model to fill in the assistant section.

Examples​

The examples below are some examples of using the endpoint, assuming there is a model with reader ID reader1

# Ensure the 'takeoff_client' package is installed
# To install it, use the command: `pip install takeoff_client`
from takeoff_client import TakeoffClient

client = TakeoffClient(base_url="http://localhost", port=3000)

template = client.chat_template(
inputs = [[
{"role":"system", "content": "Your system prompt goes here"},
{"role":"user", "content": "Your user message goes here"}
]],
reader_id="reader1",
add_generation_prompt=True
)