Skip to main content
Version: Next

Image To Text Generation


Takeoff supports image to text generation using multi-modal generative models like Llava. Models like Llava are trained to be able to describe and interact with images provided to them, and can be used for question-asnwering over images, OCR, and image captioning.

If you have loaded an image-to-text model like Llava into Takeoff, then you can inference with it using the /image_generate and /image_generate_stream endpoints. These endpoints are identical to their generate counterparts, except that they can accept images that will be included in the input to the image.

Supported Image to Text Models​


Takeoff supports Llava models based on the original repo. You can find collections of the v1.6 and v1.5 models on Huggingface. Models that are supported are any that have the architecture: LlavaLlamaForCausalLM or LlavaMistralForCausalLM. You can find the model's architecture in the config.json file on a model's Huggingface repo.

How to include an image​


To include an image, you must pass image data alongside the request. There are two ways do do this:

Multipart requests​


Multipart requests are a way to send multiple types of data in a single HTTP request. The most common use case for multipart requests on the web is for form data. Many web forms contain only text data - which can be encoded as a set of keys and values in the application/x-www-form-urlencoded format. Sometimes though, your form will need to contain media data: images, videos, audio etc. To send this data alongside the other information in the form, you can use a multipart request.

We use multipart requests in the takeoff API to send image data alongside a text prompt. The model receives the image and the text, and then generates text based on the combination of the two. For example, you might prompt the model to describe the image, or to answer a question about the image.

See the below for examples of how to use multi part requests to send image data with a request to Takeoff.

Alongside the image data (image_data) you need to send a json_data part, containing your prompt and any other parameters you want to include in the request. The specification for this part of the request matches the specification for the generate endpoint.

The model will "see" your image at the start of its text prompt, and will generate text based on the combination of the image and the rest of the prompt.

Multipart Examples​


Our takeoff client library supports transparently sending local image data with your requests. Multipart requests can also be sent from most common HTTP clients.

generation_parameters.py
import requests
import json

json_data = {"text":"USER: Describe the image to me.\nASSISTANT:"}

url = "http://localhost:3000/image_generate_stream"

data = {"json_data": json.dumps(json_data)}
files = {"image_data": open("/path/to/image.png", "rb")}

response = requests.post(url, data=data, files=files)

print(response.json())

Inline images​


Takeoff also supports supplying remote images inline with your prompt. This is useful if your image data is stored on a remote server, or for whatever reason you don't have the ability to send multipart requests.

To supply an inline images for the takeoff server to process: first, set the TAKEOFF_ALLOW_REMOTE_IMAGES=true environment variable, and then include an image tag with your request. An image tag has the following format:

<image:https://example.com/path/to/image.png>

For example, you might send a curl request as follows:

curl -X POST "http://localhost:3000/generate" \
-H "accept: application/json" -H "Content-Type: application/json" \
-d '{"text":"<image:https://picsum.photos/id/237/200/300> \n Describe this image to me."}'
danger

The inline image is downloaded and processed on the server side. While takeoff takes sensible precautions to protect itself from attack, you should only use takeoff in this configuration with image sources that you trust. A failure to do so can leave you exposed to Server-Side Request Forgery (SSRF) attacks. The inline images feature is disabled by default, and can be enabled by setting the allow_remote_images=true flag in the takeoff config (or the TAKEOFF_ALLOW_REMOTE_IMAGES=true environment variable).