Backend Selection: LLM vs VLM¶

Overview¶

Docling Graph supports two extraction backends: LLM (Language Model) for text-based extraction and VLM (Vision-Language Model) for vision-based extraction. Choosing the right backend is crucial for extraction quality and performance.

In this guide: - LLM vs VLM comparison - When to use each backend - Performance characteristics - Cost considerations - Switching between backends

Backend Comparison¶

Quick Comparison Table¶

Aspect	LLM Backend	VLM Backend
Input	Markdown text	Document images
Best For	Text-heavy documents	Complex layouts, images
Inference	Local or Remote	Local only
Speed	Fast	Slower
Accuracy	High for text	Highest for complex layouts
GPU Required	Optional (remote)	Yes (local only)
Cost	Low (local) to Medium (remote)	Medium (GPU required)
Setup	Easy	Moderate

LLM Backend¶

What is LLM Backend?¶

The LLM backend uses language models to extract structured data from markdown text. Documents are first converted to markdown using Docling, then processed by the LLM.

Architecture¶

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    A@{ shape: terminal, label: "PDF Document" }

    B@{ shape: procs, label: "Docling Conversion" }
    C@{ shape: doc, label: "Markdown Text" }
    D@{ shape: tag-proc, label: "Chunking Optional" }
    E@{ shape: procs, label: "LLM Extraction" }

    F@{ shape: doc, label: "Structured Data" }

    %% 3. Define Connections
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F

    %% 4. Apply Classes
    class A input
    class B,E process
    class C data
    class D operator
    class F output

Configuration¶

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",  # LLM backend
    inference="local"  # or "remote"
)

When to Use LLM¶

✅ Use LLM when: - Documents are primarily text-based - Layout is standard (invoices, contracts, reports) - You need remote API support - Cost efficiency is important - You want fast processing - You don't have GPU available (use remote)

❌ Don't use LLM when: - Documents have complex visual layouts - Images contain critical information - Tables have complex structures - Handwriting needs to be processed

LLM Advantages¶

Flexible Inference
Local: Use your own GPU/CPU
Remote: Use cloud APIs (OpenAI, Mistral, Gemini)
Fast Processing
Quick markdown conversion
Efficient text processing
Parallel chunking support
Cost Effective
Local inference: Free (after GPU cost)
Remote inference: Pay per token
Generally cheaper than VLM
Easy Setup
No GPU required for remote
Simple API key configuration
Wide model selection

LLM Limitations¶

Text-Only Processing
Loses visual information
May miss layout cues
Can't process images directly
OCR Dependency
Relies on Docling OCR quality
May struggle with poor scans
Handwriting not well supported
Context Limits
Large documents need chunking
May lose cross-page context
Requires consolidation for coherence

VLM Backend¶

What is VLM Backend?¶

The VLM backend uses vision-language models to extract structured data directly from document images. It processes visual information alongside text, understanding layout and structure.

Architecture¶

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    InputPDF@{ shape: terminal, label: "PDF Document" }
    InputImg@{ shape: terminal, label: "Images" }

    Convert@{ shape: procs, label: "PDF to Image<br>Conversion" }
    PageImgs@{ shape: doc, label: "Page Images" }

    VLM@{ shape: procs, label: "VLM Processing" }
    Understand@{ shape: lin-proc, label: "Visual Understanding" }
    Extract@{ shape: tag-proc, label: "Direct Extraction" }

    Output@{ shape: doc, label: "Pydantic Models" }

    %% 3. Define Connections
    %% Path A: PDF requires conversion
    InputPDF --> Convert
    Convert --> PageImgs
    PageImgs --> VLM

    %% Path B: Direct Image Input (Merges here)
    InputImg --> VLM

    %% Shared Processing Chain
    VLM --> Understand
    Understand --> Extract
    Extract --> Output

    %% 4. Apply Classes
    class InputPDF,InputImg input
    class Convert,VLM,Understand process
    class PageImgs data
    class Extract operator
    class Output output

Configuration¶

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",  # VLM backend
    inference="local",  # VLM only supports local
    docling_config="vision"  # Optional: use vision pipeline
)

When to Use VLM¶

✅ Use VLM when: - Documents have complex visual layouts - Images contain critical information - Tables have intricate structures - Forms have specific visual patterns - Highest accuracy is required - You have GPU available

❌ Don't use VLM when: - Documents are simple text - You need remote API support - GPU is not available - Processing speed is critical - Cost is a major concern

VLM Advantages¶

Visual Understanding
Processes layout and structure
Understands visual relationships
Handles complex tables
Processes embedded images
Higher Accuracy
Best for complex documents
Understands visual context
Fewer extraction errors
Better table handling
No OCR Dependency
Direct image processing
Better with poor scans
Handles handwriting better
Preserves visual information

VLM Limitations¶

Local Only
Requires local GPU
No remote API support
Higher setup complexity
GPU memory requirements
Slower Processing
Image processing overhead
Larger model size
More GPU memory needed
Longer inference time
Higher Cost
GPU required
More expensive hardware
Higher power consumption
Larger storage needs

Decision Matrix¶

By Document Type¶

Document Type	Recommended Backend	Reason
Invoices	LLM	Standard layout, text-heavy
Contracts	LLM	Text-heavy, standard format
Rheology Researchs	LLM	Text-heavy, standard layout
Forms	VLM	Visual structure important
ID Cards	VLM	Visual layout critical
Complex Tables	VLM	Visual structure needed
Handwritten	VLM	Visual processing required
Mixed Content	VLM	Images and text combined

By Infrastructure¶

Infrastructure	Recommended Backend	Configuration
No GPU	LLM Remote	`backend="llm", inference="remote"`
CPU Only	LLM Remote	`backend="llm", inference="remote"`
GPU Available	LLM or VLM Local	`backend="llm/vlm", inference="local"`
Cloud/API	LLM Remote	`backend="llm", inference="remote"`

By Priority¶

Priority	Recommended Backend	Reason
Speed	LLM	Faster processing
Accuracy	VLM	Better visual understanding
Cost	LLM Local	No API costs
Simplicity	LLM Remote	Easy setup
Offline	LLM or VLM Local	No internet needed

Performance Comparison¶

Processing Speed¶

Document: 10-page invoice PDF

LLM Local (GPU):     ~30 seconds
LLM Remote (API):    ~45 seconds
VLM Local (GPU):     ~90 seconds

Accuracy Comparison¶

Document Type: Complex invoice with tables

LLM Accuracy:  92% field extraction
VLM Accuracy:  97% field extraction

Document Type: Simple text contract

LLM Accuracy:  98% field extraction
VLM Accuracy:  96% field extraction

Cost Comparison¶

Processing 1000 documents:

LLM Local:     $0 (GPU amortized)
LLM Remote:    $50-200 (API costs)
VLM Local:     $0 (GPU amortized)
VLM Remote:    Not available

Switching Between Backends¶

From LLM to VLM¶

# Original LLM config
config_llm = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="remote"
)

# Switch to VLM
config_vlm = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",  # Change backend
    inference="local",  # Must be local for VLM
    docling_config="vision"  # Optional: use vision pipeline
)

From VLM to LLM¶

# Original VLM config
config_vlm = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",
    inference="local"
)

# Switch to LLM
config_llm = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",  # Change backend
    inference="remote",  # Can now use remote
    model_override="gpt-4-turbo"  # Specify model
)

Hybrid Approach¶

Strategy 1: Document Type Based¶

def get_config(document_path: str, document_type: str):
    """Choose backend based on document type."""
    if document_type in ["invoice", "contract", "report"]:
        # Use LLM for text-heavy documents
        return PipelineConfig(
            source=document_path,
            template="templates.BillingDocument",
            backend="llm",
            inference="remote"
        )
    else:
        # Use VLM for complex layouts
        return PipelineConfig(
            source=document_path,
            template="templates.Form",
            backend="vlm",
            inference="local"
        )

Strategy 2: Fallback Pattern¶

def extract_with_fallback(document_path: str):
    """Try LLM first, fallback to VLM if needed."""
    try:
        # Try LLM first (faster)
        config = PipelineConfig(
            source=document_path,
            template="templates.BillingDocument",
            backend="llm",
            inference="remote"
        )
        run_pipeline(config)
    except ExtractionError:
        # Fallback to VLM for better accuracy
        config = PipelineConfig(
            source=document_path,
            template="templates.BillingDocument",
            backend="vlm",
            inference="local"
        )
        run_pipeline(config)

Backend-Specific Settings¶

LLM-Specific Settings¶

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",

    # LLM-specific
    use_chunking=True,  # Split large documents
    llm_consolidation=True,  # Merge results with LLM
    max_batch_size=5  # Process multiple chunks
)

VLM-Specific Settings¶

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",

    # VLM-specific
    docling_config="vision",  # Use vision pipeline
    processing_mode="one-to-one"  # Process pages individually
)

Common Questions¶

Q: Can I use VLM with remote inference?

A: No, VLM currently only supports local inference. Use LLM backend for remote API support.

Q: Which backend is more accurate?

A: VLM is generally more accurate for complex layouts and visual documents. LLM is more accurate for simple text documents.

Q: Which backend is faster?

A: LLM is faster, especially with remote APIs. VLM requires more processing time due to image analysis.

Q: Can I switch backends mid-project?

A: Yes, backends are interchangeable. Just change the backend parameter in your config.

Q: Do I need different templates for different backends?

A: No, the same Pydantic template works with both backends.

Next Steps¶

Now that you understand backend selection:

Model Configuration → - Configure models for your chosen backend
Processing Modes - Choose processing strategy
Configuration Examples - See complete scenarios