Behind the Stack, Ep 5: Making RAG Work for Multimodal Documents

June 24, 2025 · 5 min read

Founder & Member of Technical Staff, Doubleword

Introduction

Most retrieval-augmented generation (RAG) systems assume that documents are clean, structured, and text-based. But in enterprise environments, the reality is different. Documents often contain:

Tables with nested headers, merged cells, or embedded footnotes
Charts and images that convey critical insights
Layout-heavy formats like invoices, reports, or scanned documents

When such content passes through standard RAG pipelines, the results are often poor - irrelevant retrieval and hallucinated outputs during generation.

This post explores practical strategies to enable accurate retrieval and grounded generation from messy, multimodal documents. We focus on two key stages:

Retrieval – How to index and surface relevant content that isn’t just plain text
Generation – How to present structured or visual content to an LLM for high-quality answers We’ll cover proven architectures, model recommendations, and implementation details used in real-world production systems.

Why Tables and Images Break Traditional RAG

A typical RAG pipeline looks like this:

Split the document into text chunks (paragraphs, sections)
Serialize as plaintext or Markdown
Embed with a text model (e.g., text-embedding-3, BGE, Cohere)
Store embeddings in a vector database (e.g., FAISS, Qdrant)
Retrieve top-k relevant chunks
Inject them into an LLM for answer generation

This pipeline breaks down when documents include:

Tables: Markdown flattens structure. Semantics like merged headers or totals are lost.
Images: Often dropped or processed via low-quality OCR, losing charts and diagrams.
Scanned Layouts: Visual hierarchy and multi-column formats disappear during text flattening.

The result: Important context is invisible to retrieval and misrepresented during generation.

Strategy 1: Full-Document Multimodal Embedding

Skip text serialization entirely. Instead, render the entire document as images and embed using a multimodal model that understands layout, tables, and visual context.

Pipeline:

Render each document page to an image (e.g., with Poppler or PDFPlumber)
Pass each page image to a multimodal model:
- Open source: CoLLaMA, OpenCLIP, SigLIP, BLIP-2
- Commercial: jina, XDoc, LayoutXLM (token-based)
Store embeddings in your vector index

Why It Works:

Multimodal models trained on image+text pairs preserve layout, structure, and non-textual info
No need for lossy flattening or serialization heuristics

Caveats:

Fewer production-grade models are available
Retrofitting may require reindexing
Higher compute and memory costs

Strategy 2: Visual Summarization for Text-Based Stacks

If you’re sticking with text-only embeddings, you can still integrate visual content using descriptive captions.

Pipeline:

Segment the document into layout blocks using:
- AWS Textract
- Azure Document Intelligence
- LayoutParser + Tesseract (open source)
For each table or image block, run OCR and generate a text summary using a multimodal LLM:
- Hosted: GPT-4 Vision, Claude 3 Opus, Gemini 1.5 Pro
- Open source: LLaVA 1.5, MiniGPT-4, Kosmos-2
Prompt example: "Summarize this table, including headers, key values, and trends."
Append summaries to surrounding text and embed the result using your existing model (e.g., text-embedding-3, bge-large-en)

Example Chunk: “Table Summary: Q1 revenue by region shows APAC growth of 23%, EMEA flat, and a YoY increase in total revenue of 11%. See Figure 4.”

This lets visual information participate in search relevance scoring - without changing your pipeline.

Pros:

Easy integration
Compatible with any retriever or vector DB
Enables semantic search over image/table content

Cons:

Summaries may lose granularity
Requires preprocessing for all visual elements

Generation Over Structured and Visual Content

Once you retrieve relevant chunks, you need models that can reason over visual and structured inputs.

Text-only LLMs often fail due to:

Flattened or malformed formatting (especially tables)
Missing image content
Lost spatial or layout context

Solution:

Use vision-capable LLMs during generation. These models accept image inputs and can reason over full document renders.

Supported Models:

GPT-4 Turbo (Vision)
Claude 3.5 Opus
Gemini 1.5 Pro
LLaMA 3.2 Vision (70B, self-hosted)
Yi-VL, Qwen-VL

Benefits:

No need for text serialization
Direct reasoning over charts, tables, and layout
Accurate answers to layout-grounded queries like “What does this figure show?”

Integration Tip: Pass full document context (e.g., PDF render or HTML snapshot) into your generation model. Cross-modal attention models (like LLaMA 3.2 Vision) handle this better than concatenation-based ones.

Toolchain Summary

Document Segmentation - AWS Textract, Azure Form Recognizer, LayoutParser
OCR - Tesseract, EasyOCR
Multimodal Summarization - GPT-4V, Claude 3, Gemini, LLaVA, MiniGPT-4
Text Embedding - text-embedding-3, bge-large-en, Cohere Embed
Multimodal Embedding - CoLLaMA, SigLIP, XDoc, OpenCLIP
Vision Generation - GPT-4V, Claude 3.5, LLaMA 3.2 Vision, Yi-VL

Conclusion

Traditional RAG fails when applied to structured and visual content. Tables, figures, and layout matter, especially in high-stakes domains like finance, law, and enterprise analytics.

To close the gap:

Use multimodal embedding for full-fidelity search
Or adopt visual summarization to extend existing pipelines
Use vision-capable models at generation time for grounded, reliable outputs

Multimodal RAG isn’t theoretical - it’s operational. With the right tools and strategies, you can bring structure, charts, and layout into your RAG workflows - and generate responses grounded in the full reality of enterprise documents.

Introduction​

Why Tables and Images Break Traditional RAG​

Strategy 1: Full-Document Multimodal Embedding​

Pipeline:​

Why It Works:​

Caveats:​

Strategy 2: Visual Summarization for Text-Based Stacks​

Pipeline:​

Pros:​

Cons:​

Generation Over Structured and Visual Content​

Solution:​

Supported Models:​

Benefits:​

Toolchain Summary​

Conclusion​

Introduction

Why Tables and Images Break Traditional RAG

Strategy 1: Full-Document Multimodal Embedding

Pipeline:

Why It Works:

Caveats:

Strategy 2: Visual Summarization for Text-Based Stacks

Pipeline:

Pros:

Cons:

Generation Over Structured and Visual Content

Solution:

Supported Models:

Benefits:

Toolchain Summary

Conclusion