Skip to main content

Behind the Stack, Ep 5: Making RAG Work for Multimodal Documents

· 5 min read
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Introduction

Most retrieval-augmented generation (RAG) systems assume that documents are clean, structured, and text-based. But in enterprise environments, the reality is different. Documents often contain:

  • Tables with nested headers, merged cells, or embedded footnotes
  • Charts and images that convey critical insights
  • Layout-heavy formats like invoices, reports, or scanned documents

When such content passes through standard RAG pipelines, the results are often poor - irrelevant retrieval and hallucinated outputs during generation.

This post explores practical strategies to enable accurate retrieval and grounded generation from messy, multimodal documents. We focus on two key stages:

  1. Retrieval – How to index and surface relevant content that isn’t just plain text
  2. Generation – How to present structured or visual content to an LLM for high-quality answers We’ll cover proven architectures, model recommendations, and implementation details used in real-world production systems.

Why Tables and Images Break Traditional RAG

A typical RAG pipeline looks like this:

  1. Split the document into text chunks (paragraphs, sections)
  2. Serialize as plaintext or Markdown
  3. Embed with a text model (e.g., text-embedding-3, BGE, Cohere)
  4. Store embeddings in a vector database (e.g., FAISS, Qdrant)
  5. Retrieve top-k relevant chunks
  6. Inject them into an LLM for answer generation

This pipeline breaks down when documents include:

  • Tables: Markdown flattens structure. Semantics like merged headers or totals are lost.
  • Images: Often dropped or processed via low-quality OCR, losing charts and diagrams.
  • Scanned Layouts: Visual hierarchy and multi-column formats disappear during text flattening.

The result: Important context is invisible to retrieval and misrepresented during generation.

Strategy 1: Full-Document Multimodal Embedding

Skip text serialization entirely. Instead, render the entire document as images and embed using a multimodal model that understands layout, tables, and visual context.

Pipeline:

  1. Render each document page to an image (e.g., with Poppler or PDFPlumber)
  2. Pass each page image to a multimodal model:
    • Open source: CoLLaMA, OpenCLIP, SigLIP, BLIP-2
    • Commercial: jina, XDoc, LayoutXLM (token-based)
  3. Store embeddings in your vector index

Why It Works:

  • Multimodal models trained on image+text pairs preserve layout, structure, and non-textual info
  • No need for lossy flattening or serialization heuristics

Caveats:

  • Fewer production-grade models are available
  • Retrofitting may require reindexing
  • Higher compute and memory costs

Strategy 2: Visual Summarization for Text-Based Stacks

If you’re sticking with text-only embeddings, you can still integrate visual content using descriptive captions.

Pipeline:

  1. Segment the document into layout blocks using:
    • AWS Textract
    • Azure Document Intelligence
    • LayoutParser + Tesseract (open source)
  2. For each table or image block, run OCR and generate a text summary using a multimodal LLM:
    • Hosted: GPT-4 Vision, Claude 3 Opus, Gemini 1.5 Pro
    • Open source: LLaVA 1.5, MiniGPT-4, Kosmos-2
  3. Prompt example: "Summarize this table, including headers, key values, and trends."
  4. Append summaries to surrounding text and embed the result using your existing model (e.g., text-embedding-3, bge-large-en)

Example Chunk: “Table Summary: Q1 revenue by region shows APAC growth of 23%, EMEA flat, and a YoY increase in total revenue of 11%. See Figure 4.”

This lets visual information participate in search relevance scoring - without changing your pipeline.

Pros:

  • Easy integration
  • Compatible with any retriever or vector DB
  • Enables semantic search over image/table content

Cons:

  • Summaries may lose granularity
  • Requires preprocessing for all visual elements

Generation Over Structured and Visual Content

Once you retrieve relevant chunks, you need models that can reason over visual and structured inputs.

Text-only LLMs often fail due to:

  • Flattened or malformed formatting (especially tables)
  • Missing image content
  • Lost spatial or layout context

Solution:

Use vision-capable LLMs during generation. These models accept image inputs and can reason over full document renders.

Supported Models:

  • GPT-4 Turbo (Vision)
  • Claude 3.5 Opus
  • Gemini 1.5 Pro
  • LLaMA 3.2 Vision (70B, self-hosted)
  • Yi-VL, Qwen-VL

Benefits:

  • No need for text serialization
  • Direct reasoning over charts, tables, and layout
  • Accurate answers to layout-grounded queries like “What does this figure show?”

Integration Tip: Pass full document context (e.g., PDF render or HTML snapshot) into your generation model. Cross-modal attention models (like LLaMA 3.2 Vision) handle this better than concatenation-based ones.

Toolchain Summary

  • Document Segmentation - AWS Textract, Azure Form Recognizer, LayoutParser
  • OCR - Tesseract, EasyOCR
  • Multimodal Summarization - GPT-4V, Claude 3, Gemini, LLaVA, MiniGPT-4
  • Text Embedding - text-embedding-3, bge-large-en, Cohere Embed
  • Multimodal Embedding - CoLLaMA, SigLIP, XDoc, OpenCLIP
  • Vision Generation - GPT-4V, Claude 3.5, LLaMA 3.2 Vision, Yi-VL

Conclusion

Traditional RAG fails when applied to structured and visual content. Tables, figures, and layout matter, especially in high-stakes domains like finance, law, and enterprise analytics.

To close the gap:

  • Use multimodal embedding for full-fidelity search
  • Or adopt visual summarization to extend existing pipelines
  • Use vision-capable models at generation time for grounded, reliable outputs

Multimodal RAG isn’t theoretical - it’s operational. With the right tools and strategies, you can bring structure, charts, and layout into your RAG workflows - and generate responses grounded in the full reality of enterprise documents.