Skip to content

The Extraction Process

Overview

The Extraction Process is the core of Docling Graph, transforming raw documents into structured knowledge graphs through a multi-stage pipeline. This section explains each stage in detail.

What you'll learn: - How documents are converted to structured format - Intelligent chunking strategies - Extraction backends (LLM vs VLM) - Model merging and consolidation - Pipeline orchestration


The Four-Stage Pipeline

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    A@{ shape: terminal, label: "Input Source" }

    A1@{ shape: tag-proc, label: "Input Normalization" }
    B@{ shape: procs, label: "Conversion" }
    C@{ shape: tag-proc, label: "Chunking" }
    D@{ shape: procs, label: "Extraction" }
    E@{ shape: lin-proc, label: "Merging" }

    F@{ shape: db, label: "Knowledge Graph" }

    %% 3. Define Connections
    A --> A1
    A1 --> B
    B --> C
    C --> D
    D --> E
    E --> F

    %% 4. Apply Classes
    class A input
    class A1,C operator
    class B,D,E process
    class F output

Stage 1: Document Conversion

Purpose: Convert PDF/images to structured Docling format

Process: - OCR or Vision pipeline - Layout analysis - Table extraction - Text extraction

Output: DoclingDocument with structure

Learn more: Document Conversion →


Stage 2: Chunking

Purpose: Split document into optimal chunks for LLM processing

Process: - Structure-aware splitting - Token counting - Semantic boundaries - Context preservation

Output: List of contextualized chunks

Learn more: Chunking Strategies →


Stage 3: Extraction

Purpose: Extract structured data using LLM/VLM

Process: - Backend selection (LLM/VLM) - Batch processing - Schema validation - Error handling

Output: List of Pydantic models

Learn more: Extraction Backends →


Stage 4: Merging

Purpose: Consolidate multiple extractions into single model

Process: - Programmatic merging - LLM consolidation (optional) - Conflict resolution - Validation

Output: Single consolidated model

Learn more: Model Merging →


Processing Modes

Many-to-One (Default)

Best for: Most documents

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    processing_mode="many-to-one"  # Default
)

Process: 1. Convert entire document 2. Chunk intelligently 3. Extract from each chunk 4. Merge into single model

Output: 1 consolidated model


One-to-One

Best for: Multi-page forms, page-specific data

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    processing_mode="one-to-one"
)

Process: 1. Convert entire document 2. Extract from each page 3. Return separate models

Output: N models (one per page)


Backend Comparison

Feature LLM Backend VLM Backend
Input Markdown text Images/PDFs
Accuracy High for text High for visuals
Speed Fast Slower
Cost Low (local) Medium
Best For Text documents Complex layouts

Pipeline Stages in Code

Stage Overview

from docling_graph.pipeline.stages import (
    TemplateLoadingStage,    # Load Pydantic template
    ExtractionStage,         # Extract data
    DoclingExportStage,      # Export Docling outputs
    GraphConversionStage,    # Convert to graph
    ExportStage,             # Export graph
    VisualizationStage       # Generate visualizations
)

Orchestration

from docling_graph.pipeline.orchestrator import PipelineOrchestrator

orchestrator = PipelineOrchestrator(config)
context = orchestrator.run()

# Access results
print(f"Extracted {len(context.extracted_models)} models")
print(f"Graph has {context.graph_metadata.node_count} nodes")

Extraction Flow

Complete Flow Diagram

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%% flowchart TD %% 1. Define Classes classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1 classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037 classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20 classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100 classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0 classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

%% 2. Define Nodes
Start@{ shape: terminal, label: "Input Source" }

Normalize@{ shape: procs, label: "Input Normalization" }
CheckInput{"Input Type"}

Convert@{ shape: procs, label: "Document Conversion<br/>PDF/Image" }
TextProc@{ shape: lin-proc, label: "Text Processing<br/>Text/Markdown" }
LoadDoc@{ shape: lin-proc, label: "Load DoclingDocument<br/>Skip to Graph" }

CheckMode{"Process. Mode"}
CheckChunk{"Chunking?"}

PageExtract@{ shape: lin-proc, label: "Page-by-Page Extraction" }
FullDoc@{ shape: lin-proc, label: "Full Document Extraction" }

Chunk@{ shape: tag-proc, label: "Structure-Aware Chunking" }
Batch@{ shape: tag-proc, label: "Batch Chunks" }

Extract@{ shape: procs, label: "Extract from Batches" }

CheckMerge{"Multiple Models?"}

Merge@{ shape: lin-proc, label: "Programmatic Merge" }
Single@{ shape: doc, label: "Single Model" }

CheckConsol{"Consolidation?"}
Consol@{ shape: procs, label: "LLM Consolidation" }

Final@{ shape: doc, label: "Final Model" }
Graph@{ shape: db, label: "Knowledge Graph" }

%% 3. Define Connections
Start --> Normalize
Normalize --> CheckInput

CheckInput -- "PDF/Image" --> Convert
CheckInput -- "Text/Markdown" --> TextProc
CheckInput -- "DoclingDocument" --> LoadDoc

Convert --> CheckMode
TextProc --> CheckMode

LoadDoc --> Graph

CheckMode -- Many-to-One --> CheckChunk
CheckMode -- One-to-One --> PageExtract

CheckChunk -- Yes --> Chunk
CheckChunk -- No --> FullDoc

Chunk --> Batch
Batch --> Extract

FullDoc --> Extract
PageExtract --> Extract

Extract --> CheckMerge
CheckMerge -- Yes --> Merge
CheckMerge -- No --> Single

Merge --> CheckConsol
CheckConsol -- Yes --> Consol
CheckConsol -- No --> Final

Consol --> Final
Single --> Final
Final --> Graph

%% 4. Apply Classes
class Start input
class Normalize,Convert,Extract,Consol process
class TextProc,LoadDoc,PageExtract,FullDoc,Merge process
class Chunk,Batch operator
class CheckInput,CheckMode,CheckChunk,CheckMerge,CheckConsol decision
class Single data
class Final,Graph output

```


Key Concepts

1. Document Conversion

Transform raw documents into structured format: python from docling_graph.core.extractors import DocumentProcessor processor = DocumentProcessor(docling_config="ocr") document = processor.convert_to_docling_doc("document.pdf")

Learn more: Document Conversion →


2. Chunking

Split documents intelligently:

from docling_graph.core.extractors import DocumentChunker

chunker = DocumentChunker(
    tokenizer_name="sentence-transformers/all-MiniLM-L6-v2",
    chunk_max_tokens=512
)
chunks = chunker.chunk_document(document)

Learn more: Chunking Strategies →


3. Extraction

Extract structured data:

from docling_graph.core.extractors import ExtractorFactory

extractor = ExtractorFactory.create_extractor(
    processing_mode="many-to-one",
    backend_name="llm",
    llm_client=client
)
models, doc = extractor.extract(source, template)

Learn more: Extraction Backends →


4. Merging

Consolidate multiple models:

from docling_graph.core.utils import merge_pydantic_models

merged = merge_pydantic_models(models, template)

Learn more: Model Merging →


Performance Optimization

Chunking vs No Chunking

Approach Speed Accuracy Memory Best For
Chunking Fast High Low Large docs
No Chunking Slow Medium High Small docs

Batch Processing

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    use_chunking=True,      # Enable chunking
    max_batch_size=5        # Process 5 chunks at once
)

Error Handling

Extraction Errors

from docling_graph.exceptions import ExtractionError

try:
    run_pipeline(config)
except ExtractionError as e:
    print(f"Extraction failed: {e.message}")
    print(f"Details: {e.details}")

Pipeline Errors

from docling_graph.exceptions import PipelineError

try:
    run_pipeline(config)
except PipelineError as e:
    print(f"Pipeline failed at stage: {e.details['stage']}")

Section Contents

1. Document Conversion

Learn how documents are converted to structured format using Docling pipelines.

Topics: - OCR vs Vision pipelines - Layout analysis - Table extraction - Multi-language support


2. Chunking Strategies

Understand intelligent document chunking for optimal LLM processing.

Topics: - Structure-aware chunking - Token management - Semantic boundaries - Provider-specific optimization


3. Extraction Backends

Deep dive into LLM and VLM extraction backends.

Topics: - LLM backend (text-based) - VLM backend (vision-based) - Backend selection - Performance comparison


4. Model Merging

Learn how multiple extractions are consolidated into single models.

Topics: - Programmatic merging - LLM consolidation - Conflict resolution - Validation strategies


5. Batch Processing

Optimize extraction with intelligent batching.

Topics: - Chunk batching - Context window management - Adaptive batch sizing - Performance tuning


6. Pipeline Orchestration

Understand how pipeline stages are coordinated through the extraction process.

Topics: - Stage execution - Context management - Error handling - Resource cleanup


Quick Examples

📍 Basic Extraction

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="local"
)

run_pipeline(config)

📍 High-Accuracy Extraction

config = PipelineConfig(
    source="complex_document.pdf",
    template="templates.ScholarlyRheologyPaper",
    backend="vlm",              # Vision backend
    processing_mode="one-to-one",
    docling_config="vision"     # Vision pipeline
)

run_pipeline(config)

📍 Optimized for Large Documents

config = PipelineConfig(
    source="large_document.pdf",
    template="templates.Contract",
    backend="llm",
    use_chunking=True,          # Enable chunking
    max_batch_size=3            # Smaller batches
)

run_pipeline(config)

Best Practices

👍 Choose the Right Backend

# ✅ Good - Match backend to document type
if document_has_complex_layout:
    backend = "vlm"
else:
    backend = "llm"

👍 Enable Chunking for Large Documents

# ✅ Good - Use chunking for efficiency
config = PipelineConfig(
    source="large_doc.pdf",
    template="templates.BillingDocument",
    use_chunking=True  # Recommended
)

Troubleshooting

🐛 Extraction Returns Empty Results

Solution:

# Check document conversion
processor = DocumentProcessor()
document = processor.convert_to_docling_doc("document.pdf")
markdown = processor.extract_full_markdown(document)

if not markdown.strip():
    print("Document conversion failed")

🐛 Out of Memory

Solution:

# Enable chunking and reduce batch size
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    use_chunking=True,
    max_batch_size=1  # Smaller batches
)

🐛 Slow Extraction

Solution:

# Use local backend for faster inference
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="local"
)


Next Steps

Ready to dive deeper? Start with:

  1. Document Conversion → - Learn about Docling pipelines
  2. Chunking Strategies → - Optimize document splitting
  3. Extraction Backends → - Choose the right backend