Extraction Backends¶
Overview¶
Extraction backends are the engines that extract structured data from documents. Docling Graph supports two types: LLM backends (text-based) and VLM backends (vision-based).
In this guide: - LLM vs VLM comparison - Backend selection criteria - Configuration and usage - Model capability tiers - Performance optimization - Error handling
New: Model Capability Detection
Docling Graph now automatically detects model capabilities and adapts prompts and consolidation strategies based on model size. See Model Capabilities for details.
Backend Types¶
Quick Comparison¶
| Feature | LLM Backend | VLM Backend |
|---|---|---|
| Input | Markdown text | Images/PDFs directly |
| Processing | Text-based | Vision-based |
| Accuracy | High for text | High for visuals |
| Speed | Fast | Slower |
| Cost | Low (local) / Medium (API) | Medium |
| GPU | Optional | Recommended |
| Best For | Standard documents | Complex layouts |
LLM Backend¶
What is LLM Backend?¶
The LLM (Language Model) backend processes documents as text, using markdown extracted from PDFs. It supports both local and remote models.
Architecture¶
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
%% 1. Define Classes
classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
A@{ shape: terminal, label: "PDF Document" }
B@{ shape: procs, label: "Docling Conversion" }
C@{ shape: doc, label: "Markdown Text" }
D@{ shape: tag-proc, label: "Chunking Optional" }
E@{ shape: procs, label: "LLM Extraction" }
F@{ shape: doc, label: "Structured Data" }
%% 3. Define Connections
A --> B
B --> C
C --> D
D --> E
E --> F
%% 4. Apply Classes
class A input
class B,E process
class C data
class D operator
class F output
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm", # LLM backend
inference="local", # or "remote"
provider_override="ollama",
model_override="llama3.1:8b"
)
Model Capability Detection¶
Docling Graph automatically detects model capabilities based on parameter count and adapts its behavior:
from docling_graph import run_pipeline, PipelineConfig
# Small model (1B-7B) - Uses SIMPLE tier
config = PipelineConfig(
backend="llm",
inference="local",
provider_override="ollama",
model_override="llama3.2:3b" # Automatically detected as SIMPLE
)
# Medium model (7B-13B) - Uses STANDARD tier
config = PipelineConfig(
backend="llm",
inference="local",
provider_override="ollama",
model_override="llama3.1:8b" # Automatically detected as STANDARD
)
# Large model (13B+) - Uses ADVANCED tier
config = PipelineConfig(
backend="llm",
inference="remote",
provider_override="openai",
model_override="gpt-4-turbo" # Automatically detected as ADVANCED
)
Capability Tiers:
| Tier | Model Size | Prompt Style | Consolidation |
|---|---|---|---|
| SIMPLE | 1B-7B | Minimal instructions | Basic merge |
| STANDARD | 7B-13B | Balanced instructions | Standard merge |
| ADVANCED | 13B+ | Detailed instructions | Chain of Density |
See Model Capabilities for complete details.
LLM Backend Features¶
✅ Strengths¶
- Fast Processing
- Quick text extraction
- Efficient chunking
-
Parallel processing
-
Cost Effective
- Local models are free
- Remote APIs are affordable
-
No GPU required (local)
-
Flexible
- Multiple providers
- Easy to switch models
-
API or local
-
Accurate for Text
- Excellent for standard documents
- Good table understanding
- Strong reasoning
❌ Limitations¶
- Text-Only
- No visual understanding
- Relies on OCR quality
-
May miss layout cues
-
Context Limits
- Requires chunking for large docs
- May lose cross-page context
- Needs merging
Supported Providers¶
Local Providers¶
Ollama:
config = PipelineConfig(
backend="llm",
inference="local",
provider_override="ollama",
model_override="llama3.1:8b"
)
vLLM:
config = PipelineConfig(
backend="llm",
inference="local",
provider_override="vllm",
model_override="ibm-granite/granite-4.0-1b"
)
Remote Providers¶
Mistral AI:
config = PipelineConfig(
backend="llm",
inference="remote",
provider_override="mistral",
model_override="mistral-large-latest"
)
OpenAI:
config = PipelineConfig(
backend="llm",
inference="remote",
provider_override="openai",
model_override="gpt-4-turbo"
)
Google Gemini:
config = PipelineConfig(
backend="llm",
inference="remote",
provider_override="gemini",
model_override="gemini-2.5-flash"
)
IBM watsonx:
config = PipelineConfig(
backend="llm",
inference="remote",
provider_override="watsonx",
model_override="ibm/granite-13b-chat-v2"
)
LLM Backend Usage¶
Basic Extraction¶
from docling_graph.core.extractors.backends import LlmBackend
from docling_graph.llm_clients import get_client
from docling_graph.llm_clients.config import resolve_effective_model_config
# Initialize client
effective = resolve_effective_model_config("ollama", "llama3.1:8b")
client = get_client("ollama")(model_config=effective)
# Create backend
backend = LlmBackend(llm_client=client)
# Extract from markdown
model = backend.extract_from_markdown(
markdown="# BillingDocument\n\nInvoice Number: INV-001\nTotal: $1000",
template=InvoiceTemplate,
context="full document",
is_partial=False
)
print(model)
With Consolidation¶
# Extract from multiple chunks
models = []
for chunk in chunks:
model = backend.extract_from_markdown(
markdown=chunk,
template=InvoiceTemplate,
context=f"chunk {i}",
is_partial=True
)
if model:
models.append(model)
# Consolidate with LLM
from docling_graph.core.utils import merge_pydantic_models
programmatic_merge = merge_pydantic_models(models, InvoiceTemplate)
final_model = backend.consolidate_from_pydantic_models(
raw_models=models,
programmatic_model=programmatic_merge,
template=InvoiceTemplate
)
Chain of Density Consolidation
For ADVANCED tier models (13B+), consolidation uses a multi-turn "Chain of Density" approach:
- Initial Merge: Create first consolidated version
- Refinement: Identify and resolve conflicts
- Final Polish: Ensure completeness and accuracy
This produces higher quality results but uses more tokens. See Model Capabilities.
VLM Backend¶
What is VLM Backend?¶
The VLM (Vision-Language Model) backend processes documents visually, understanding layout, images, and text together like a human would.
Architecture¶
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
%% 1. Define Classes
classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
InputPDF@{ shape: terminal, label: "PDF Document" }
InputImg@{ shape: terminal, label: "Images" }
Convert@{ shape: procs, label: "PDF to Image<br>Conversion" }
PageImgs@{ shape: doc, label: "Page Images" }
VLM@{ shape: procs, label: "VLM Processing" }
Understand@{ shape: lin-proc, label: "Visual Understanding" }
Extract@{ shape: tag-proc, label: "Direct Extraction" }
Output@{ shape: doc, label: "Pydantic Models" }
%% 3. Define Connections
%% Path A: PDF requires conversion
InputPDF --> Convert
Convert --> PageImgs
PageImgs --> VLM
%% Path B: Direct Image Input (Merges here)
InputImg --> VLM
%% Shared Processing Chain
VLM --> Understand
Understand --> Extract
Extract --> Output
%% 4. Apply Classes
class InputPDF,InputImg input
class Convert,VLM,Understand process
class PageImgs data
class Extract operator
class Output output
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="vlm", # VLM backend
inference="local", # Only local supported
model_override="numind/NuExtract-2.0-8B"
)
VLM Backend Features¶
✅ Strengths¶
- Visual Understanding
- Sees layout and structure
- Understands images
-
Handles complex formats
-
No Chunking Needed
- Processes pages directly
- No context window limits
-
Simpler pipeline
-
Robust to OCR Issues
- Doesn't rely on OCR
- Handles poor quality
-
Better for handwriting
-
Layout Aware
- Understands visual hierarchy
- Recognizes forms
- Detects tables visually
❌ Limitations¶
- Slower
- More computation
- GPU recommended
-
Longer processing time
-
Local Only
- No remote API support
- Requires local GPU
-
Higher resource usage
-
Model Size
- Large models (2B-8B params)
- More memory needed
- Longer startup time
Supported Models¶
NuExtract 2.0 (Recommended):
# 2B model (faster, less accurate)
model_override="numind/NuExtract-2.0-2B"
# 8B model (slower, more accurate)
model_override="numind/NuExtract-2.0-8B"
VLM Backend Usage¶
Basic Extraction¶
from docling_graph.core.extractors.backends import VlmBackend
# Initialize backend
backend = VlmBackend(model_name="numind/NuExtract-2.0-8B")
# Extract from document
models = backend.extract_from_document(
source="document.pdf",
template=InvoiceTemplate
)
print(f"Extracted {len(models)} models")
With Pipeline¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="complex_form.pdf",
template="templates.ApplicationForm",
backend="vlm",
inference="local",
processing_mode="one-to-one" # One model per page
)
run_pipeline(config)
Backend Selection¶
LLM Backend Criteria¶
- Document is text-heavy
- Need fast processing
- Want to use remote APIs
- Processing many documents
- Standard layout
- Good OCR quality
VLM Backend Criteria¶
- Complex visual layout
- Poor OCR quality
- Handwritten content
- Image-heavy documents
- Form-based extraction
- Have GPU available
Complete Examples¶
📍 LLM Backend (Local)¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="invoice.pdf",
template="templates.BillingDocument",
# LLM backend with Ollama
backend="llm",
inference="local",
provider_override="ollama",
model_override="llama3.1:8b",
# Optimized settings
use_chunking=True,
processing_mode="many-to-one",
output_dir="outputs/llm_local"
)
run_pipeline(config)
📍 LLM Backend (Remote)¶
from docling_graph import run_pipeline, PipelineConfig
import os
# Set API key
os.environ["MISTRAL_API_KEY"] = "your_api_key"
config = PipelineConfig(
source="contract.pdf",
template="templates.Contract",
# LLM backend with Mistral API
backend="llm",
inference="remote",
provider_override="mistral",
model_override="mistral-large-latest",
# High accuracy settings
use_chunking=True,
llm_consolidation=True,
processing_mode="many-to-one",
output_dir="outputs/llm_remote"
)
run_pipeline(config)
📍 VLM Backend¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="complex_form.pdf",
template="templates.ApplicationForm",
# VLM backend
backend="vlm",
inference="local",
model_override="numind/NuExtract-2.0-8B",
# VLM settings
processing_mode="one-to-one", # One model per page
docling_config="vision", # Vision pipeline
use_chunking=False, # VLM doesn't need chunking
output_dir="outputs/vlm"
)
run_pipeline(config)
📍 Hybrid Approach¶
from docling_graph import run_pipeline, PipelineConfig
def process_document(doc_path: str, doc_type: str):
"""Process document with appropriate backend."""
if doc_type == "form":
# Use VLM for forms
backend = "vlm"
inference = "local"
processing_mode = "one-to-one"
else:
# Use LLM for standard docs
backend = "llm"
inference = "remote"
processing_mode = "many-to-one"
config = PipelineConfig(
source=doc_path,
template=f"templates.{doc_type.capitalize()}",
backend=backend,
inference=inference,
processing_mode=processing_mode
)
run_pipeline(config)
# Process different document types
process_document("invoice.pdf", "invoice") # LLM
process_document("form.pdf", "form") # VLM
Error Handling¶
LLM Backend Errors¶
from docling_graph.exceptions import ExtractionError
try:
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="remote"
)
run_pipeline(config)
except ExtractionError as e:
print(f"Extraction failed: {e.message}")
print(f"Details: {e.details}")
# Fallback to local
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="local"
)
run_pipeline(config)
VLM Backend Errors¶
from docling_graph.exceptions import ExtractionError
try:
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="vlm"
)
run_pipeline(config)
except ExtractionError as e:
print(f"VLM extraction failed: {e.message}")
# Fallback to LLM
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="local"
)
run_pipeline(config)
Best Practices¶
👍 Match Backend to Document Type¶
# ✅ Good - Choose based on document
if document_is_form:
backend = "vlm"
elif document_is_standard:
backend = "llm"
👍 Use Local for Development¶
# ✅ Good - Fast iteration
config = PipelineConfig(
source="test.pdf",
template="templates.BillingDocument",
backend="llm",
inference="local" # Fast for testing
)
👍 Use Remote for Production¶
# ✅ Good - Reliable and scalable
config = PipelineConfig(
source="production.pdf",
template="templates.BillingDocument",
backend="llm",
inference="remote" # Reliable
)
👍 Cleanup Resources¶
# ✅ Good - Always cleanup
from docling_graph.core.extractors.backends import VlmBackend
backend = VlmBackend(model_name="numind/NuExtract-2.0-8B")
try:
models = backend.extract_from_document(source, template)
finally:
backend.cleanup() # Free GPU memory
Enhanced GPU Cleanup
VLM backend now includes enhanced GPU memory management:
- Model-to-CPU Transfer: Moves model to CPU before deletion
- CUDA Cache Clearing: Explicitly clears GPU cache
- Memory Tracking: Logs memory usage before/after cleanup
- Multi-GPU Support: Handles multiple GPU devices
This ensures GPU memory is properly released, especially important for long-running processes.
👍 Use Real Tokenizers¶
# ✅ Good - Accurate token counting
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
backend="llm",
inference="local",
provider_override="ollama",
model_override="llama3.1:8b",
use_chunking=True # Uses real tokenizer with 20% safety margin
)
Benefits: - Prevents context window overflows - More efficient chunk packing - Better resource utilization
Troubleshooting¶
🐛 LLM Returns Empty Results¶
Solution:
# Check markdown extraction
from docling_graph.core.extractors import DocumentProcessor
processor = DocumentProcessor()
document = processor.convert_to_docling_doc("document.pdf")
markdown = processor.extract_full_markdown(document)
if not markdown.strip():
print("Markdown extraction failed")
🐛 VLM Out of Memory¶
Solution:
# Use smaller model
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="vlm",
model_override="numind/NuExtract-2.0-2B" # Smaller model
)
🐛 Slow VLM Processing¶
Solution:
# Switch to LLM for speed
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm", # Faster
inference="local"
)
Advanced Features¶
Provider-Specific Batching¶
Different LLM providers have different optimal batching strategies:
from docling_graph import run_pipeline, PipelineConfig
# OpenAI - Uses default 95% threshold
config = PipelineConfig(
backend="llm",
inference="remote",
provider_override="openai",
model_override="gpt-4-turbo",
use_chunking=True # Automatically uses 95% threshold (default)
)
# Anthropic - Uses default 95% threshold
config = PipelineConfig(
backend="llm",
inference="remote",
provider_override="anthropic",
model_override="claude-3-opus",
use_chunking=True # Automatically uses 95% threshold (default)
)
# Ollama - Uses default 95% threshold
config = PipelineConfig(
backend="llm",
inference="local",
provider_override="ollama",
model_override="llama3.1:8b",
use_chunking=True # Automatically uses 95% threshold (default)
)
Why Different Thresholds? - OpenAI/Google: Robust to near-limit contexts → aggressive batching - Anthropic: More conservative → moderate batching - Ollama/Local: Variable performance → conservative batching
Next Steps¶
Now that you understand extraction backends:
- Model Capabilities → - Learn about adaptive prompting
- Model Merging → - Learn how to consolidate extractions
- Batch Processing → - Optimize chunk processing
- Performance Tuning → - Advanced optimization