Skip to content

Backend Selection: LLM vs VLM

Overview

Docling Graph supports two extraction backends: LLM (Language Model) for text-based extraction and VLM (Vision-Language Model) for vision-based extraction. Choosing the right backend is crucial for extraction quality and performance.

In this guide: - LLM vs VLM comparison - When to use each backend - Performance characteristics - Cost considerations - Switching between backends


Backend Comparison

Quick Comparison Table

Aspect LLM Backend VLM Backend
Input Markdown text Document images
Best For Text-heavy documents Complex layouts, images
Inference Local or Remote Local only
Speed Fast Slower
Accuracy High for text Highest for complex layouts
GPU Required Optional (remote) Yes (local only)
Cost Low (local) to Medium (remote) Medium (GPU required)
Setup Easy Moderate

LLM Backend

What is LLM Backend?

The LLM backend uses language models to extract structured data from markdown text. Documents are first converted to markdown using Docling, then processed by the LLM.

Architecture

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    A@{ shape: terminal, label: "PDF Document" }

    B@{ shape: procs, label: "Docling Conversion" }
    C@{ shape: doc, label: "Markdown Text" }
    D@{ shape: tag-proc, label: "Chunking Optional" }
    E@{ shape: procs, label: "LLM Extraction" }

    F@{ shape: doc, label: "Structured Data" }

    %% 3. Define Connections
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F

    %% 4. Apply Classes
    class A input
    class B,E process
    class C data
    class D operator
    class F output

Configuration

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",  # LLM backend
    inference="local"  # or "remote"
)

When to Use LLM

Use LLM when: - Documents are primarily text-based - Layout is standard (invoices, contracts, reports) - You need remote API support - Cost efficiency is important - You want fast processing - You don't have GPU available (use remote)

Don't use LLM when: - Documents have complex visual layouts - Images contain critical information - Tables have complex structures - Handwriting needs to be processed

LLM Advantages

  1. Flexible Inference
  2. Local: Use your own GPU/CPU
  3. Remote: Use cloud APIs (OpenAI, Mistral, Gemini)

  4. Fast Processing

  5. Quick markdown conversion
  6. Efficient text processing
  7. Parallel chunking support

  8. Cost Effective

  9. Local inference: Free (after GPU cost)
  10. Remote inference: Pay per token
  11. Generally cheaper than VLM

  12. Easy Setup

  13. No GPU required for remote
  14. Simple API key configuration
  15. Wide model selection

LLM Limitations

  1. Text-Only Processing
  2. Loses visual information
  3. May miss layout cues
  4. Can't process images directly

  5. OCR Dependency

  6. Relies on Docling OCR quality
  7. May struggle with poor scans
  8. Handwriting not well supported

  9. Context Limits

  10. Large documents need chunking
  11. May lose cross-page context
  12. Requires consolidation for coherence

VLM Backend

What is VLM Backend?

The VLM backend uses vision-language models to extract structured data directly from document images. It processes visual information alongside text, understanding layout and structure.

Architecture

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    InputPDF@{ shape: terminal, label: "PDF Document" }
    InputImg@{ shape: terminal, label: "Images" }

    Convert@{ shape: procs, label: "PDF to Image<br>Conversion" }
    PageImgs@{ shape: doc, label: "Page Images" }

    VLM@{ shape: procs, label: "VLM Processing" }
    Understand@{ shape: lin-proc, label: "Visual Understanding" }
    Extract@{ shape: tag-proc, label: "Direct Extraction" }

    Output@{ shape: doc, label: "Pydantic Models" }

    %% 3. Define Connections
    %% Path A: PDF requires conversion
    InputPDF --> Convert
    Convert --> PageImgs
    PageImgs --> VLM

    %% Path B: Direct Image Input (Merges here)
    InputImg --> VLM

    %% Shared Processing Chain
    VLM --> Understand
    Understand --> Extract
    Extract --> Output

    %% 4. Apply Classes
    class InputPDF,InputImg input
    class Convert,VLM,Understand process
    class PageImgs data
    class Extract operator
    class Output output

Configuration

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",  # VLM backend
    inference="local",  # VLM only supports local
    docling_config="vision"  # Optional: use vision pipeline
)

When to Use VLM

Use VLM when: - Documents have complex visual layouts - Images contain critical information - Tables have intricate structures - Forms have specific visual patterns - Highest accuracy is required - You have GPU available

Don't use VLM when: - Documents are simple text - You need remote API support - GPU is not available - Processing speed is critical - Cost is a major concern

VLM Advantages

  1. Visual Understanding
  2. Processes layout and structure
  3. Understands visual relationships
  4. Handles complex tables
  5. Processes embedded images

  6. Higher Accuracy

  7. Best for complex documents
  8. Understands visual context
  9. Fewer extraction errors
  10. Better table handling

  11. No OCR Dependency

  12. Direct image processing
  13. Better with poor scans
  14. Handles handwriting better
  15. Preserves visual information

VLM Limitations

  1. Local Only
  2. Requires local GPU
  3. No remote API support
  4. Higher setup complexity
  5. GPU memory requirements

  6. Slower Processing

  7. Image processing overhead
  8. Larger model size
  9. More GPU memory needed
  10. Longer inference time

  11. Higher Cost

  12. GPU required
  13. More expensive hardware
  14. Higher power consumption
  15. Larger storage needs

Decision Matrix

By Document Type

Document Type Recommended Backend Reason
Invoices LLM Standard layout, text-heavy
Contracts LLM Text-heavy, standard format
Rheology Researchs LLM Text-heavy, standard layout
Forms VLM Visual structure important
ID Cards VLM Visual layout critical
Complex Tables VLM Visual structure needed
Handwritten VLM Visual processing required
Mixed Content VLM Images and text combined

By Infrastructure

Infrastructure Recommended Backend Configuration
No GPU LLM Remote backend="llm", inference="remote"
CPU Only LLM Remote backend="llm", inference="remote"
GPU Available LLM or VLM Local backend="llm/vlm", inference="local"
Cloud/API LLM Remote backend="llm", inference="remote"

By Priority

Priority Recommended Backend Reason
Speed LLM Faster processing
Accuracy VLM Better visual understanding
Cost LLM Local No API costs
Simplicity LLM Remote Easy setup
Offline LLM or VLM Local No internet needed

Performance Comparison

Processing Speed

Document: 10-page invoice PDF

LLM Local (GPU):     ~30 seconds
LLM Remote (API):    ~45 seconds
VLM Local (GPU):     ~90 seconds

Accuracy Comparison

Document Type: Complex invoice with tables

LLM Accuracy:  92% field extraction
VLM Accuracy:  97% field extraction

Document Type: Simple text contract

LLM Accuracy:  98% field extraction
VLM Accuracy:  96% field extraction

Cost Comparison

Processing 1000 documents:

LLM Local:     $0 (GPU amortized)
LLM Remote:    $50-200 (API costs)
VLM Local:     $0 (GPU amortized)
VLM Remote:    Not available

Switching Between Backends

From LLM to VLM

# Original LLM config
config_llm = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="remote"
)

# Switch to VLM
config_vlm = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",  # Change backend
    inference="local",  # Must be local for VLM
    docling_config="vision"  # Optional: use vision pipeline
)

From VLM to LLM

# Original VLM config
config_vlm = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",
    inference="local"
)

# Switch to LLM
config_llm = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",  # Change backend
    inference="remote",  # Can now use remote
    model_override="gpt-4-turbo"  # Specify model
)

Hybrid Approach

Strategy 1: Document Type Based

def get_config(document_path: str, document_type: str):
    """Choose backend based on document type."""
    if document_type in ["invoice", "contract", "report"]:
        # Use LLM for text-heavy documents
        return PipelineConfig(
            source=document_path,
            template="templates.BillingDocument",
            backend="llm",
            inference="remote"
        )
    else:
        # Use VLM for complex layouts
        return PipelineConfig(
            source=document_path,
            template="templates.Form",
            backend="vlm",
            inference="local"
        )

Strategy 2: Fallback Pattern

def extract_with_fallback(document_path: str):
    """Try LLM first, fallback to VLM if needed."""
    try:
        # Try LLM first (faster)
        config = PipelineConfig(
            source=document_path,
            template="templates.BillingDocument",
            backend="llm",
            inference="remote"
        )
        run_pipeline(config)
    except ExtractionError:
        # Fallback to VLM for better accuracy
        config = PipelineConfig(
            source=document_path,
            template="templates.BillingDocument",
            backend="vlm",
            inference="local"
        )
        run_pipeline(config)

Backend-Specific Settings

LLM-Specific Settings

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",

    # LLM-specific
    use_chunking=True,  # Split large documents
    llm_consolidation=True,  # Merge results with LLM
    max_batch_size=5  # Process multiple chunks
)

VLM-Specific Settings

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",

    # VLM-specific
    docling_config="vision",  # Use vision pipeline
    processing_mode="one-to-one"  # Process pages individually
)

Common Questions

Q: Can I use VLM with remote inference?

A: No, VLM currently only supports local inference. Use LLM backend for remote API support.

Q: Which backend is more accurate?

A: VLM is generally more accurate for complex layouts and visual documents. LLM is more accurate for simple text documents.

Q: Which backend is faster?

A: LLM is faster, especially with remote APIs. VLM requires more processing time due to image analysis.

Q: Can I switch backends mid-project?

A: Yes, backends are interchangeable. Just change the backend parameter in your config.

Q: Do I need different templates for different backends?

A: No, the same Pydantic template works with both backends.


Next Steps

Now that you understand backend selection:

  1. Model Configuration → - Configure models for your chosen backend
  2. Processing Modes - Choose processing strategy
  3. Configuration Examples - See complete scenarios