Skip to main content
Version: 0.12.x

Image To Text Generation

Takeoff supports image to text generation using multi-modal generative models like Llava. Models like Llava are trained to be able to describe and interact with images provided to them, and can be used for question-asnwering over images, OCR, and image captioning.

If you have loaded an image-to-text model like Llava into Takeoff, then you can inference with it using the /image_generate and /image_generate_stream endpoints. These endpoints are identical to their generate counterparts, except that they can accept images that will be included in the input to the image.

How to include an image

To include an image, you must pass image data alongside the request. This is done using a multi-part request.

See the below for examples of how to use multi part requests to send image data with a request to Takeoff.

Alongside the image data (image_data) you need to send a json_data part, containing your prompt and any other parameters you want to include in the request. The specification for this part of the request matches the specification for the generate endpoint.

The model will "see" your image at the start of its text prompt, and will generate text based on the combination of the image and the rest of the prompt.

Supported Models

Takeoff currently supports the Llava models that have been converted to a hugging compatible format. See the llava-hf page for a list of supported image to text models.


Takeoff can be interfaced with via the REST API, the GUI, or through our Python client.
import requests
import json

json_data = {"text":"USER: Describe the image to me.\nASSISTANT:"}

url = "http://localhost:3000/image_generate_stream"

data = {"json_data": json.dumps(json_data)}
files = {"image_data": open("/path/to/image.png", "rb")}

response =, data=data, files=files)