Skip to main content
Version: 0.13.x

Classification endpoints

Using classification & reranking models with Takeoff requires making a request to the classification endpoint: classify.

The /classify endpoint supports two types of tasks/models:

  • Classification tasks - classify an input text by assigning a per-class output score based on a model-defined labelling scheme. An example is sentiment analysis, which assigns one of a set of model-defined labels based on the sentiment a text specifies. E.g. cardiffnlp/twitter-roberta-base-sentiment-latest assigns three scores as a vector [negative score, neutral score, positive score])
  • Reranking tasks - embed a text and query and then use the entire per-token vectors to determine if it's relevant to a particular query. These are particularly useful in RAG applications or for models which compare a text with another text or provided label.

The endpoint takes only a consumer_group (defaults to primary) and the text to be embedded, and returns a JSON response whose format depends on the type of model used for inference.

To use a classification or reranking model you can pass in any huggingface model that has a ForSequenceClassification architecture, such as BAAI/bge-reranker-base.


These are the docs for interfacing with classification & reranking models. If you wish to interface with generative models, see the docs here, or for embedding models, see here.

Input formats

The /classify endpoint accepts a range of input formats, distinguishing between single text inputs (e.g. a text for sentiment analysis) and concatenated inputs (e.g. a query and a document whose relevance to the query will be returned).

More on input formats & concatenations

Individual texts - strings or lists of strings

An individual string can be passed, or a list of strings can be passed for batch inference. This is usually done with models that have set buckets/labels with which to classify, such as in sentiment analysis tasks.

Concatenations - lists of lists of strings

Two texts can be concatenated as part of the classify endpoint's preprocessing, triggered by passing a list of texts. Concatenation is performed by joining the two texts either side of a model specific [SEP] token.

Why shouldn't I concatenate the sentences myself?

Some models (for example BERT) are trained with a special [SEP] token, that distinguishes two different parts of a sentence. For example, this might be used to separate questions and answers, or English sentences and their French translations. The model, having been trained on this token, has better performance on these two-sentence tasks. This has to be done server side as to access the tokenizer's [SEP] token.

Concatenation Example:

text = [

becomes server side a batch of text like this:

['query1 [SEP] doc1', 'query1 [SEP] doc2', 'query1 [SEP] doc3'] 

This then yields three sets of logits, one for each of the concatenated elements. For reranking models, these elements are typically a (monotonic) representation of similarity.

A list of lists triggers a batch of concatenation inferences.

Reranking vs Classification

Reranking Example:

Here is an example of using the /classify endpoint with a reranking model:

query = "What colour is the dog?"
doc1 = "The puppy is light blue."
doc2 = "Feet itching is a symptom of athletes foot, which is caused by a fungus."
doc3 = "Space is very big."

all_docs = [doc1, doc2, doc3]

payload = {"text": [[query, doc] for doc in all_docs]}

# ... send to takeoff ...

# Response:
# {"result":[[1.1953125],[-10.1875],[-10.1875]]}

It is apparant that doc1 which answers the query has a very high score while the irrelevant documents have a much lower score.

Zero-shot Classification Example:

You can use the concatenation feature to create zero-shot classification pipelines. One way to perform zero-shot classification is to use a model trained on Natural Language Inference (NLI). If you want to classify text into one of three buckets, you can construct premise and hypothesis strings, where the premise is the text, and the hypothesis is a sentence like this text is about <x> where x is your potential label.

A model trained on NLI (such as microsoft/deberta-large-mnli) will return one of three entailment labels, with a premise assigned to the 'entail' label if it is most likely that what is in the hypothesis is entailed by the premise.

This model returns three logits, one for each of the following labels:

"1": "NEUTRAL",
premise = "one day I will see the world."
hyp_1 = "This sentence is about travel."
hyp_2 = "This sentence is about cooking."
hyp_3 = "This sentence is about dancing."

all_hyps = [hyp_1, hyp_2, hyp_3]

# premise and hypothesis are concatenated server-side
payload = {"text": [[premise, hyp] for hyp in all_hyps]}
# payload: {'text': [['one day I will see the world.', 'This sentence is about travel.'], ['one day I will see the world.', 'This sentence is about cooking.'], ['one day I will see the world.', 'This sentence is about dancing.']]}
# ... send to takeoff ...

# Response:

The Entailment score on the first example is the largest, so you might choose to classify this as a sentence about travel.

Interfacing Examples

# Ensure the 'takeoff_client' package is installed
# To install it, use the command: `pip install takeoff_client`
from takeoff_client import TakeoffClient

client = TakeoffClient(base_url="http://localhost", port=3000)
input_text = [
['NASAs Hubble Traces String of Pearls Star Clusters in Galaxy Collisions','This is a sentence about science.'],
['NASAs Hubble Traces String of Pearls Star Clusters in Galaxy Collisions','This is a sentence about sport.'],

response = client.classify(input_text, consumer_group='primary')

Batching requests

Classification models use dynamic batching. In dynamic batching, the batch size is fixed, and incoming requests are buffered until an entire batch is waiting or a timeout is reached, allowing for optimal hardware utilisation. For more information - including how to choose a suitable timeout value - see our conceptual guide to batching.

The timeout and max batch size can be configured by setting the TAKEOFF_BATCH_DURATION_MILLIS and TAKEOFF_MAX_BATCH_SIZE environment variables:

# Timeout of 100ms and max batch size of 32
docker run -it --gpus all -e TAKEOFF_BATCH_DURATION_MILLIS=100 -e TAKEOFF_MAX_BATCH_SIZE=32...