Skip to content

Extraction Backends

Overview

Extraction backends are the engines that extract structured data from documents. Docling Graph supports two types: LLM backends (text-based) and VLM backends (vision-based).

In this guide: - LLM vs VLM comparison - Backend selection criteria - Configuration and usage - Model capability tiers - Performance optimization - Error handling

New: Model Capability Detection

Docling Graph now automatically detects model capabilities and adapts prompts and consolidation strategies based on model size. See Model Capabilities for details.


Backend Types

Quick Comparison

Feature LLM Backend VLM Backend
Input Markdown text Images/PDFs directly
Processing Text-based Vision-based
Accuracy High for text High for visuals
Speed Fast Slower
Cost Low (local) / Medium (API) Medium
GPU Optional Recommended
Best For Standard documents Complex layouts

LLM Backend

What is LLM Backend?

The LLM (Language Model) backend processes documents as text, using markdown extracted from PDFs. It supports both local and remote models.

Architecture

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    A@{ shape: terminal, label: "PDF Document" }

    B@{ shape: procs, label: "Docling Conversion" }
    C@{ shape: doc, label: "Markdown Text" }
    D@{ shape: tag-proc, label: "Chunking Optional" }
    E@{ shape: procs, label: "LLM Extraction" }

    F@{ shape: doc, label: "Structured Data" }

    %% 3. Define Connections
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F

    %% 4. Apply Classes
    class A input
    class B,E process
    class C data
    class D operator
    class F output

Configuration

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",           # LLM backend
    inference="local",       # or "remote"
    provider_override="ollama",
    model_override="llama3.1:8b"
)

Model Capability Detection

Docling Graph automatically detects model capabilities based on parameter count and adapts its behavior:

from docling_graph import run_pipeline, PipelineConfig

# Small model (1B-7B) - Uses SIMPLE tier
config = PipelineConfig(
    backend="llm",
    inference="local",
    provider_override="ollama",
    model_override="llama3.2:3b"  # Automatically detected as SIMPLE
)

# Medium model (7B-13B) - Uses STANDARD tier
config = PipelineConfig(
    backend="llm",
    inference="local",
    provider_override="ollama",
    model_override="llama3.1:8b"  # Automatically detected as STANDARD
)

# Large model (13B+) - Uses ADVANCED tier
config = PipelineConfig(
    backend="llm",
    inference="remote",
    provider_override="openai",
    model_override="gpt-4-turbo"  # Automatically detected as ADVANCED
)

Capability Tiers:

Tier Model Size Prompt Style Consolidation
SIMPLE 1B-7B Minimal instructions Basic merge
STANDARD 7B-13B Balanced instructions Standard merge
ADVANCED 13B+ Detailed instructions Chain of Density

See Model Capabilities for complete details.


LLM Backend Features

✅ Strengths

  1. Fast Processing
  2. Quick text extraction
  3. Efficient chunking
  4. Parallel processing

  5. Cost Effective

  6. Local models are free
  7. Remote APIs are affordable
  8. No GPU required (local)

  9. Flexible

  10. Multiple providers
  11. Easy to switch models
  12. API or local

  13. Accurate for Text

  14. Excellent for standard documents
  15. Good table understanding
  16. Strong reasoning

❌ Limitations

  1. Text-Only
  2. No visual understanding
  3. Relies on OCR quality
  4. May miss layout cues

  5. Context Limits

  6. Requires chunking for large docs
  7. May lose cross-page context
  8. Needs merging

Supported Providers

Local Providers

Ollama:

config = PipelineConfig(
    backend="llm",
    inference="local",
    provider_override="ollama",
    model_override="llama3.1:8b"
)

vLLM:

config = PipelineConfig(
    backend="llm",
    inference="local",
    provider_override="vllm",
    model_override="ibm-granite/granite-4.0-1b"
)

Remote Providers

Mistral AI:

config = PipelineConfig(
    backend="llm",
    inference="remote",
    provider_override="mistral",
    model_override="mistral-large-latest"
)

OpenAI:

config = PipelineConfig(
    backend="llm",
    inference="remote",
    provider_override="openai",
    model_override="gpt-4-turbo"
)

Google Gemini:

config = PipelineConfig(
    backend="llm",
    inference="remote",
    provider_override="gemini",
    model_override="gemini-2.5-flash"
)

IBM watsonx:

config = PipelineConfig(
    backend="llm",
    inference="remote",
    provider_override="watsonx",
    model_override="ibm/granite-13b-chat-v2"
)


LLM Backend Usage

Basic Extraction

from docling_graph.core.extractors.backends import LlmBackend
from docling_graph.llm_clients import get_client
from docling_graph.llm_clients.config import resolve_effective_model_config

# Initialize client
effective = resolve_effective_model_config("ollama", "llama3.1:8b")
client = get_client("ollama")(model_config=effective)

# Create backend
backend = LlmBackend(llm_client=client)

# Extract from markdown
model = backend.extract_from_markdown(
    markdown="# BillingDocument\n\nInvoice Number: INV-001\nTotal: $1000",
    template=InvoiceTemplate,
    context="full document",
    is_partial=False
)

print(model)

With Consolidation

# Extract from multiple chunks
models = []
for chunk in chunks:
    model = backend.extract_from_markdown(
        markdown=chunk,
        template=InvoiceTemplate,
        context=f"chunk {i}",
        is_partial=True
    )
    if model:
        models.append(model)

# Consolidate with LLM
from docling_graph.core.utils import merge_pydantic_models

programmatic_merge = merge_pydantic_models(models, InvoiceTemplate)

final_model = backend.consolidate_from_pydantic_models(
    raw_models=models,
    programmatic_model=programmatic_merge,
    template=InvoiceTemplate
)

Chain of Density Consolidation

For ADVANCED tier models (13B+), consolidation uses a multi-turn "Chain of Density" approach:

  1. Initial Merge: Create first consolidated version
  2. Refinement: Identify and resolve conflicts
  3. Final Polish: Ensure completeness and accuracy

This produces higher quality results but uses more tokens. See Model Capabilities.


VLM Backend

What is VLM Backend?

The VLM (Vision-Language Model) backend processes documents visually, understanding layout, images, and text together like a human would.

Architecture

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    InputPDF@{ shape: terminal, label: "PDF Document" }
    InputImg@{ shape: terminal, label: "Images" }

    Convert@{ shape: procs, label: "PDF to Image<br>Conversion" }
    PageImgs@{ shape: doc, label: "Page Images" }

    VLM@{ shape: procs, label: "VLM Processing" }
    Understand@{ shape: lin-proc, label: "Visual Understanding" }
    Extract@{ shape: tag-proc, label: "Direct Extraction" }

    Output@{ shape: doc, label: "Pydantic Models" }

    %% 3. Define Connections
    %% Path A: PDF requires conversion
    InputPDF --> Convert
    Convert --> PageImgs
    PageImgs --> VLM

    %% Path B: Direct Image Input (Merges here)
    InputImg --> VLM

    %% Shared Processing Chain
    VLM --> Understand
    Understand --> Extract
    Extract --> Output

    %% 4. Apply Classes
    class InputPDF,InputImg input
    class Convert,VLM,Understand process
    class PageImgs data
    class Extract operator
    class Output output

Configuration

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",                      # VLM backend
    inference="local",                  # Only local supported
    model_override="numind/NuExtract-2.0-8B"
)

VLM Backend Features

✅ Strengths

  1. Visual Understanding
  2. Sees layout and structure
  3. Understands images
  4. Handles complex formats

  5. No Chunking Needed

  6. Processes pages directly
  7. No context window limits
  8. Simpler pipeline

  9. Robust to OCR Issues

  10. Doesn't rely on OCR
  11. Handles poor quality
  12. Better for handwriting

  13. Layout Aware

  14. Understands visual hierarchy
  15. Recognizes forms
  16. Detects tables visually

❌ Limitations

  1. Slower
  2. More computation
  3. GPU recommended
  4. Longer processing time

  5. Local Only

  6. No remote API support
  7. Requires local GPU
  8. Higher resource usage

  9. Model Size

  10. Large models (2B-8B params)
  11. More memory needed
  12. Longer startup time

Supported Models

NuExtract 2.0 (Recommended):

# 2B model (faster, less accurate)
model_override="numind/NuExtract-2.0-2B"

# 8B model (slower, more accurate)
model_override="numind/NuExtract-2.0-8B"


VLM Backend Usage

Basic Extraction

from docling_graph.core.extractors.backends import VlmBackend

# Initialize backend
backend = VlmBackend(model_name="numind/NuExtract-2.0-8B")

# Extract from document
models = backend.extract_from_document(
    source="document.pdf",
    template=InvoiceTemplate
)

print(f"Extracted {len(models)} models")

With Pipeline

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="complex_form.pdf",
    template="templates.ApplicationForm",
    backend="vlm",
    inference="local",
    processing_mode="one-to-one"  # One model per page
)

run_pipeline(config)

Backend Selection

LLM Backend Criteria

  • Document is text-heavy
  • Need fast processing
  • Want to use remote APIs
  • Processing many documents
  • Standard layout
  • Good OCR quality

VLM Backend Criteria

  • Complex visual layout
  • Poor OCR quality
  • Handwritten content
  • Image-heavy documents
  • Form-based extraction
  • Have GPU available

Complete Examples

📍 LLM Backend (Local)

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="invoice.pdf",
    template="templates.BillingDocument",

    # LLM backend with Ollama
    backend="llm",
    inference="local",
    provider_override="ollama",
    model_override="llama3.1:8b",

    # Optimized settings
    use_chunking=True,
    processing_mode="many-to-one",

    output_dir="outputs/llm_local"
)

run_pipeline(config)

📍 LLM Backend (Remote)

from docling_graph import run_pipeline, PipelineConfig
import os

# Set API key
os.environ["MISTRAL_API_KEY"] = "your_api_key"

config = PipelineConfig(
    source="contract.pdf",
    template="templates.Contract",

    # LLM backend with Mistral API
    backend="llm",
    inference="remote",
    provider_override="mistral",
    model_override="mistral-large-latest",

    # High accuracy settings
    use_chunking=True,
    llm_consolidation=True,
    processing_mode="many-to-one",

    output_dir="outputs/llm_remote"
)

run_pipeline(config)

📍 VLM Backend

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="complex_form.pdf",
    template="templates.ApplicationForm",

    # VLM backend
    backend="vlm",
    inference="local",
    model_override="numind/NuExtract-2.0-8B",

    # VLM settings
    processing_mode="one-to-one",  # One model per page
    docling_config="vision",       # Vision pipeline
    use_chunking=False,            # VLM doesn't need chunking

    output_dir="outputs/vlm"
)

run_pipeline(config)

📍 Hybrid Approach

from docling_graph import run_pipeline, PipelineConfig

def process_document(doc_path: str, doc_type: str):
    """Process document with appropriate backend."""

    if doc_type == "form":
        # Use VLM for forms
        backend = "vlm"
        inference = "local"
        processing_mode = "one-to-one"
    else:
        # Use LLM for standard docs
        backend = "llm"
        inference = "remote"
        processing_mode = "many-to-one"

    config = PipelineConfig(
        source=doc_path,
        template=f"templates.{doc_type.capitalize()}",
        backend=backend,
        inference=inference,
        processing_mode=processing_mode
    )

    run_pipeline(config)

# Process different document types
process_document("invoice.pdf", "invoice")  # LLM
process_document("form.pdf", "form")        # VLM

Error Handling

LLM Backend Errors

from docling_graph.exceptions import ExtractionError

try:
    config = PipelineConfig(
        source="document.pdf",
        template="templates.BillingDocument",
        backend="llm",
        inference="remote"
    )
    run_pipeline(config)

except ExtractionError as e:
    print(f"Extraction failed: {e.message}")
    print(f"Details: {e.details}")

    # Fallback to local
    config = PipelineConfig(
        source="document.pdf",
        template="templates.BillingDocument",
        backend="llm",
        inference="local"
    )
    run_pipeline(config)

VLM Backend Errors

from docling_graph.exceptions import ExtractionError

try:
    config = PipelineConfig(
        source="document.pdf",
        template="templates.BillingDocument",
        backend="vlm"
    )
    run_pipeline(config)

except ExtractionError as e:
    print(f"VLM extraction failed: {e.message}")

    # Fallback to LLM
    config = PipelineConfig(
        source="document.pdf",
        template="templates.BillingDocument",
        backend="llm",
        inference="local"
    )
    run_pipeline(config)

Best Practices

👍 Match Backend to Document Type

# ✅ Good - Choose based on document
if document_is_form:
    backend = "vlm"
elif document_is_standard:
    backend = "llm"

👍 Use Local for Development

# ✅ Good - Fast iteration
config = PipelineConfig(
    source="test.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="local"  # Fast for testing
)

👍 Use Remote for Production

# ✅ Good - Reliable and scalable
config = PipelineConfig(
    source="production.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="remote"  # Reliable
)

👍 Cleanup Resources

# ✅ Good - Always cleanup
from docling_graph.core.extractors.backends import VlmBackend

backend = VlmBackend(model_name="numind/NuExtract-2.0-8B")
try:
    models = backend.extract_from_document(source, template)
finally:
    backend.cleanup()  # Free GPU memory

Enhanced GPU Cleanup

VLM backend now includes enhanced GPU memory management:

  • Model-to-CPU Transfer: Moves model to CPU before deletion
  • CUDA Cache Clearing: Explicitly clears GPU cache
  • Memory Tracking: Logs memory usage before/after cleanup
  • Multi-GPU Support: Handles multiple GPU devices

This ensures GPU memory is properly released, especially important for long-running processes.

👍 Use Real Tokenizers

# ✅ Good - Accurate token counting
from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    backend="llm",
    inference="local",
    provider_override="ollama",
    model_override="llama3.1:8b",
    use_chunking=True  # Uses real tokenizer with 20% safety margin
)

Benefits: - Prevents context window overflows - More efficient chunk packing - Better resource utilization


Troubleshooting

🐛 LLM Returns Empty Results

Solution:

# Check markdown extraction
from docling_graph.core.extractors import DocumentProcessor

processor = DocumentProcessor()
document = processor.convert_to_docling_doc("document.pdf")
markdown = processor.extract_full_markdown(document)

if not markdown.strip():
    print("Markdown extraction failed")

🐛 VLM Out of Memory

Solution:

# Use smaller model
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",
    model_override="numind/NuExtract-2.0-2B"  # Smaller model
)

🐛 Slow VLM Processing

Solution:

# Switch to LLM for speed
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",  # Faster
    inference="local"
)


Advanced Features

Provider-Specific Batching

Different LLM providers have different optimal batching strategies:

from docling_graph import run_pipeline, PipelineConfig

# OpenAI - Uses default 95% threshold
config = PipelineConfig(
    backend="llm",
    inference="remote",
    provider_override="openai",
    model_override="gpt-4-turbo",
    use_chunking=True  # Automatically uses 95% threshold (default)
)

# Anthropic - Uses default 95% threshold
config = PipelineConfig(
    backend="llm",
    inference="remote",
    provider_override="anthropic",
    model_override="claude-3-opus",
    use_chunking=True  # Automatically uses 95% threshold (default)
)

# Ollama - Uses default 95% threshold
config = PipelineConfig(
    backend="llm",
    inference="local",
    provider_override="ollama",
    model_override="llama3.1:8b",
    use_chunking=True  # Automatically uses 95% threshold (default)
)

Why Different Thresholds? - OpenAI/Google: Robust to near-limit contexts → aggressive batching - Anthropic: More conservative → moderate batching - Ollama/Local: Variable performance → conservative batching


Next Steps

Now that you understand extraction backends:

  1. Model Capabilities → - Learn about adaptive prompting
  2. Model Merging → - Learn how to consolidate extractions
  3. Batch Processing → - Optimize chunk processing
  4. Performance Tuning → - Advanced optimization