Backend Selection: LLM vs VLM¶
Overview¶
Docling Graph supports two extraction backends: LLM (Language Model) for text-based extraction and VLM (Vision-Language Model) for vision-based extraction. Choosing the right backend is crucial for extraction quality and performance.
In this guide: - LLM vs VLM comparison - When to use each backend - Performance characteristics - Cost considerations - Switching between backends
Backend Comparison¶
Quick Comparison Table¶
| Aspect | LLM Backend | VLM Backend |
|---|---|---|
| Input | Markdown text | Document images |
| Best For | Text-heavy documents | Complex layouts, images |
| Inference | Local or Remote | Local only |
| Speed | Fast | Slower |
| Accuracy | High for text | Highest for complex layouts |
| GPU Required | Optional (remote) | Yes (local only) |
| Cost | Low (local) to Medium (remote) | Medium (GPU required) |
| Setup | Easy | Moderate |
LLM Backend¶
What is LLM Backend?¶
The LLM backend uses language models to extract structured data from markdown text. Documents are first converted to markdown using Docling, then processed by the LLM.
Architecture¶
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
%% 1. Define Classes
classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
A@{ shape: terminal, label: "PDF Document" }
B@{ shape: procs, label: "Docling Conversion" }
C@{ shape: doc, label: "Markdown Text" }
D@{ shape: tag-proc, label: "Chunking Optional" }
E@{ shape: procs, label: "LLM Extraction" }
F@{ shape: doc, label: "Structured Data" }
%% 3. Define Connections
A --> B
B --> C
C --> D
D --> E
E --> F
%% 4. Apply Classes
class A input
class B,E process
class C data
class D operator
class F output
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm", # LLM backend
inference="local" # or "remote"
)
When to Use LLM¶
✅ Use LLM when: - Documents are primarily text-based - Layout is standard (invoices, contracts, reports) - You need remote API support - Cost efficiency is important - You want fast processing - You don't have GPU available (use remote)
❌ Don't use LLM when: - Documents have complex visual layouts - Images contain critical information - Tables have complex structures - Handwriting needs to be processed
LLM Advantages¶
- Flexible Inference
- Local: Use your own GPU/CPU
-
Remote: Use cloud APIs (OpenAI, Mistral, Gemini)
-
Fast Processing
- Quick markdown conversion
- Efficient text processing
-
Parallel chunking support
-
Cost Effective
- Local inference: Free (after GPU cost)
- Remote inference: Pay per token
-
Generally cheaper than VLM
-
Easy Setup
- No GPU required for remote
- Simple API key configuration
- Wide model selection
LLM Limitations¶
- Text-Only Processing
- Loses visual information
- May miss layout cues
-
Can't process images directly
-
OCR Dependency
- Relies on Docling OCR quality
- May struggle with poor scans
-
Handwriting not well supported
-
Context Limits
- Large documents need chunking
- May lose cross-page context
- Requires consolidation for coherence
VLM Backend¶
What is VLM Backend?¶
The VLM backend uses vision-language models to extract structured data directly from document images. It processes visual information alongside text, understanding layout and structure.
Architecture¶
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
%% 1. Define Classes
classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
InputPDF@{ shape: terminal, label: "PDF Document" }
InputImg@{ shape: terminal, label: "Images" }
Convert@{ shape: procs, label: "PDF to Image<br>Conversion" }
PageImgs@{ shape: doc, label: "Page Images" }
VLM@{ shape: procs, label: "VLM Processing" }
Understand@{ shape: lin-proc, label: "Visual Understanding" }
Extract@{ shape: tag-proc, label: "Direct Extraction" }
Output@{ shape: doc, label: "Pydantic Models" }
%% 3. Define Connections
%% Path A: PDF requires conversion
InputPDF --> Convert
Convert --> PageImgs
PageImgs --> VLM
%% Path B: Direct Image Input (Merges here)
InputImg --> VLM
%% Shared Processing Chain
VLM --> Understand
Understand --> Extract
Extract --> Output
%% 4. Apply Classes
class InputPDF,InputImg input
class Convert,VLM,Understand process
class PageImgs data
class Extract operator
class Output output
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="vlm", # VLM backend
inference="local", # VLM only supports local
docling_config="vision" # Optional: use vision pipeline
)
When to Use VLM¶
✅ Use VLM when: - Documents have complex visual layouts - Images contain critical information - Tables have intricate structures - Forms have specific visual patterns - Highest accuracy is required - You have GPU available
❌ Don't use VLM when: - Documents are simple text - You need remote API support - GPU is not available - Processing speed is critical - Cost is a major concern
VLM Advantages¶
- Visual Understanding
- Processes layout and structure
- Understands visual relationships
- Handles complex tables
-
Processes embedded images
-
Higher Accuracy
- Best for complex documents
- Understands visual context
- Fewer extraction errors
-
Better table handling
-
No OCR Dependency
- Direct image processing
- Better with poor scans
- Handles handwriting better
- Preserves visual information
VLM Limitations¶
- Local Only
- Requires local GPU
- No remote API support
- Higher setup complexity
-
GPU memory requirements
-
Slower Processing
- Image processing overhead
- Larger model size
- More GPU memory needed
-
Longer inference time
-
Higher Cost
- GPU required
- More expensive hardware
- Higher power consumption
- Larger storage needs
Decision Matrix¶
By Document Type¶
| Document Type | Recommended Backend | Reason |
|---|---|---|
| Invoices | LLM | Standard layout, text-heavy |
| Contracts | LLM | Text-heavy, standard format |
| Rheology Researchs | LLM | Text-heavy, standard layout |
| Forms | VLM | Visual structure important |
| ID Cards | VLM | Visual layout critical |
| Complex Tables | VLM | Visual structure needed |
| Handwritten | VLM | Visual processing required |
| Mixed Content | VLM | Images and text combined |
By Infrastructure¶
| Infrastructure | Recommended Backend | Configuration |
|---|---|---|
| No GPU | LLM Remote | backend="llm", inference="remote" |
| CPU Only | LLM Remote | backend="llm", inference="remote" |
| GPU Available | LLM or VLM Local | backend="llm/vlm", inference="local" |
| Cloud/API | LLM Remote | backend="llm", inference="remote" |
By Priority¶
| Priority | Recommended Backend | Reason |
|---|---|---|
| Speed | LLM | Faster processing |
| Accuracy | VLM | Better visual understanding |
| Cost | LLM Local | No API costs |
| Simplicity | LLM Remote | Easy setup |
| Offline | LLM or VLM Local | No internet needed |
Performance Comparison¶
Processing Speed¶
Document: 10-page invoice PDF
LLM Local (GPU): ~30 seconds
LLM Remote (API): ~45 seconds
VLM Local (GPU): ~90 seconds
Accuracy Comparison¶
Document Type: Complex invoice with tables
LLM Accuracy: 92% field extraction
VLM Accuracy: 97% field extraction
Document Type: Simple text contract
LLM Accuracy: 98% field extraction
VLM Accuracy: 96% field extraction
Cost Comparison¶
Processing 1000 documents:
LLM Local: $0 (GPU amortized)
LLM Remote: $50-200 (API costs)
VLM Local: $0 (GPU amortized)
VLM Remote: Not available
Switching Between Backends¶
From LLM to VLM¶
# Original LLM config
config_llm = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="remote"
)
# Switch to VLM
config_vlm = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="vlm", # Change backend
inference="local", # Must be local for VLM
docling_config="vision" # Optional: use vision pipeline
)
From VLM to LLM¶
# Original VLM config
config_vlm = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="vlm",
inference="local"
)
# Switch to LLM
config_llm = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm", # Change backend
inference="remote", # Can now use remote
model_override="gpt-4-turbo" # Specify model
)
Hybrid Approach¶
Strategy 1: Document Type Based¶
def get_config(document_path: str, document_type: str):
"""Choose backend based on document type."""
if document_type in ["invoice", "contract", "report"]:
# Use LLM for text-heavy documents
return PipelineConfig(
source=document_path,
template="templates.BillingDocument",
backend="llm",
inference="remote"
)
else:
# Use VLM for complex layouts
return PipelineConfig(
source=document_path,
template="templates.Form",
backend="vlm",
inference="local"
)
Strategy 2: Fallback Pattern¶
def extract_with_fallback(document_path: str):
"""Try LLM first, fallback to VLM if needed."""
try:
# Try LLM first (faster)
config = PipelineConfig(
source=document_path,
template="templates.BillingDocument",
backend="llm",
inference="remote"
)
run_pipeline(config)
except ExtractionError:
# Fallback to VLM for better accuracy
config = PipelineConfig(
source=document_path,
template="templates.BillingDocument",
backend="vlm",
inference="local"
)
run_pipeline(config)
Backend-Specific Settings¶
LLM-Specific Settings¶
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
# LLM-specific
use_chunking=True, # Split large documents
llm_consolidation=True, # Merge results with LLM
max_batch_size=5 # Process multiple chunks
)
VLM-Specific Settings¶
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="vlm",
# VLM-specific
docling_config="vision", # Use vision pipeline
processing_mode="one-to-one" # Process pages individually
)
Common Questions¶
Q: Can I use VLM with remote inference?
A: No, VLM currently only supports local inference. Use LLM backend for remote API support.
Q: Which backend is more accurate?
A: VLM is generally more accurate for complex layouts and visual documents. LLM is more accurate for simple text documents.
Q: Which backend is faster?
A: LLM is faster, especially with remote APIs. VLM requires more processing time due to image analysis.
Q: Can I switch backends mid-project?
A: Yes, backends are interchangeable. Just change the backend parameter in your config.
Q: Do I need different templates for different backends?
A: No, the same Pydantic template works with both backends.
Next Steps¶
Now that you understand backend selection:
- Model Configuration → - Configure models for your chosen backend
- Processing Modes - Choose processing strategy
- Configuration Examples - See complete scenarios