The Extraction Process¶

Overview¶

The Extraction Process is the core of Docling Graph, transforming raw documents into structured knowledge graphs through a multi-stage pipeline. This section explains each stage in detail.

What you'll learn: - How documents are converted to structured format - Intelligent chunking strategies - Extraction backends (LLM vs VLM) - Model merging and consolidation - Pipeline orchestration

The Four-Stage Pipeline¶

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    A@{ shape: terminal, label: "Input Source" }

    A1@{ shape: tag-proc, label: "Input Normalization" }
    B@{ shape: procs, label: "Conversion" }
    C@{ shape: tag-proc, label: "Chunking" }
    D@{ shape: procs, label: "Extraction" }
    E@{ shape: lin-proc, label: "Merging" }

    F@{ shape: db, label: "Knowledge Graph" }

    %% 3. Define Connections
    A --> A1
    A1 --> B
    B --> C
    C --> D
    D --> E
    E --> F

    %% 4. Apply Classes
    class A input
    class A1,C operator
    class B,D,E process
    class F output

Stage 1: Document Conversion¶

Purpose: Convert PDF/images to structured Docling format

Process: - OCR or Vision pipeline - Layout analysis - Table extraction - Text extraction

Output: DoclingDocument with structure

Learn more: Document Conversion →

Stage 2: Chunking¶

Purpose: Split document into optimal chunks for LLM processing

Process: - Structure-aware splitting - Token counting - Semantic boundaries - Context preservation

Output: List of contextualized chunks

Learn more: Chunking Strategies →

Stage 3: Extraction¶

Purpose: Extract structured data using LLM/VLM

Process: - Backend selection (LLM/VLM) - Batch processing - Schema validation - Error handling

Output: List of Pydantic models

Learn more: Extraction Backends →

Stage 4: Merging¶

Purpose: Consolidate multiple extractions into single model

Process: - Programmatic merging - LLM consolidation (optional) - Conflict resolution - Validation

Output: Single consolidated model

Learn more: Model Merging →

Processing Modes¶

Many-to-One (Default)¶

Best for: Most documents

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    processing_mode="many-to-one"  # Default
)

Process: 1. Convert entire document 2. Chunk intelligently 3. Extract from each chunk 4. Merge into single model

Output: 1 consolidated model

One-to-One¶

Best for: Multi-page forms, page-specific data

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    processing_mode="one-to-one"
)

Process: 1. Convert entire document 2. Extract from each page 3. Return separate models

Output: N models (one per page)

Backend Comparison¶

Feature	LLM Backend	VLM Backend
Input	Markdown text	Images/PDFs
Accuracy	High for text	High for visuals
Speed	Fast	Slower
Cost	Low (local)	Medium
Best For	Text documents	Complex layouts

Pipeline Stages in Code¶

Stage Overview¶

from docling_graph.pipeline.stages import (
    TemplateLoadingStage,    # Load Pydantic template
    ExtractionStage,         # Extract data
    DoclingExportStage,      # Export Docling outputs
    GraphConversionStage,    # Convert to graph
    ExportStage,             # Export graph
    VisualizationStage       # Generate visualizations
)

Orchestration¶

from docling_graph.pipeline.orchestrator import PipelineOrchestrator

orchestrator = PipelineOrchestrator(config)
context = orchestrator.run()

# Access results
print(f"Extracted {len(context.extracted_models)} models")
print(f"Graph has {context.graph_metadata.node_count} nodes")

Extraction Flow¶

Complete Flow Diagram¶

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%% flowchart TD %% 1. Define Classes classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1 classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037 classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20 classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100 classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0 classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

%% 2. Define Nodes
Start@{ shape: terminal, label: "Input Source" }

Normalize@{ shape: procs, label: "Input Normalization" }
CheckInput{"Input Type"}

Convert@{ shape: procs, label: "Document Conversion<br/>PDF/Image" }
TextProc@{ shape: lin-proc, label: "Text Processing<br/>Text/Markdown" }
LoadDoc@{ shape: lin-proc, label: "Load DoclingDocument<br/>Skip to Graph" }

CheckMode{"Process. Mode"}
CheckChunk{"Chunking?"}

PageExtract@{ shape: lin-proc, label: "Page-by-Page Extraction" }
FullDoc@{ shape: lin-proc, label: "Full Document Extraction" }

Chunk@{ shape: tag-proc, label: "Structure-Aware Chunking" }
Batch@{ shape: tag-proc, label: "Batch Chunks" }

Extract@{ shape: procs, label: "Extract from Batches" }

CheckMerge{"Multiple Models?"}

Merge@{ shape: lin-proc, label: "Programmatic Merge" }
Single@{ shape: doc, label: "Single Model" }

CheckConsol{"Consolidation?"}
Consol@{ shape: procs, label: "LLM Consolidation" }

Final@{ shape: doc, label: "Final Model" }
Graph@{ shape: db, label: "Knowledge Graph" }

%% 3. Define Connections
Start --> Normalize
Normalize --> CheckInput

CheckInput -- "PDF/Image" --> Convert
CheckInput -- "Text/Markdown" --> TextProc
CheckInput -- "DoclingDocument" --> LoadDoc

Convert --> CheckMode
TextProc --> CheckMode

LoadDoc --> Graph

CheckMode -- Many-to-One --> CheckChunk
CheckMode -- One-to-One --> PageExtract

CheckChunk -- Yes --> Chunk
CheckChunk -- No --> FullDoc

Chunk --> Batch
Batch --> Extract

FullDoc --> Extract
PageExtract --> Extract

Extract --> CheckMerge
CheckMerge -- Yes --> Merge
CheckMerge -- No --> Single

Merge --> CheckConsol
CheckConsol -- Yes --> Consol
CheckConsol -- No --> Final

Consol --> Final
Single --> Final
Final --> Graph

%% 4. Apply Classes
class Start input
class Normalize,Convert,Extract,Consol process
class TextProc,LoadDoc,PageExtract,FullDoc,Merge process
class Chunk,Batch operator
class CheckInput,CheckMode,CheckChunk,CheckMerge,CheckConsol decision
class Single data
class Final,Graph output

```

Key Concepts¶

1. Document Conversion¶

Transform raw documents into structured format: python from docling_graph.core.extractors import DocumentProcessor processor = DocumentProcessor(docling_config="ocr") document = processor.convert_to_docling_doc("document.pdf")

Learn more: Document Conversion →

2. Chunking¶

Split documents intelligently:

from docling_graph.core.extractors import DocumentChunker

chunker = DocumentChunker(
    tokenizer_name="sentence-transformers/all-MiniLM-L6-v2",
    chunk_max_tokens=512
)
chunks = chunker.chunk_document(document)

Learn more: Chunking Strategies →

3. Extraction¶

Extract structured data:

from docling_graph.core.extractors import ExtractorFactory

extractor = ExtractorFactory.create_extractor(
    processing_mode="many-to-one",
    backend_name="llm",
    llm_client=client
)
models, doc = extractor.extract(source, template)

Learn more: Extraction Backends →

4. Merging¶

Consolidate multiple models:

from docling_graph.core.utils import merge_pydantic_models

merged = merge_pydantic_models(models, template)

Learn more: Model Merging →

Performance Optimization¶

Chunking vs No Chunking¶

Approach	Speed	Accuracy	Memory	Best For
Chunking	Fast	High	Low	Large docs
No Chunking	Slow	Medium	High	Small docs

Batch Processing¶

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    use_chunking=True,      # Enable chunking
    max_batch_size=5        # Process 5 chunks at once
)

Error Handling¶

Extraction Errors¶

from docling_graph.exceptions import ExtractionError

try:
    run_pipeline(config)
except ExtractionError as e:
    print(f"Extraction failed: {e.message}")
    print(f"Details: {e.details}")

Pipeline Errors¶

from docling_graph.exceptions import PipelineError

try:
    run_pipeline(config)
except PipelineError as e:
    print(f"Pipeline failed at stage: {e.details['stage']}")

Section Contents¶

1. Document Conversion ¶

Learn how documents are converted to structured format using Docling pipelines.

Topics: - OCR vs Vision pipelines - Layout analysis - Table extraction - Multi-language support

2. Chunking Strategies ¶

Understand intelligent document chunking for optimal LLM processing.

Topics: - Structure-aware chunking - Token management - Semantic boundaries - Provider-specific optimization

3. Extraction Backends ¶

Deep dive into LLM and VLM extraction backends.

Topics: - LLM backend (text-based) - VLM backend (vision-based) - Backend selection - Performance comparison

4. Model Merging ¶

Learn how multiple extractions are consolidated into single models.

Topics: - Programmatic merging - LLM consolidation - Conflict resolution - Validation strategies

5. Batch Processing ¶

Optimize extraction with intelligent batching.

Topics: - Chunk batching - Context window management - Adaptive batch sizing - Performance tuning

6. Pipeline Orchestration¶

Understand how pipeline stages are coordinated through the extraction process.

Topics: - Stage execution - Context management - Error handling - Resource cleanup

Quick Examples¶

📍 Basic Extraction¶

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="local"
)

run_pipeline(config)

📍 High-Accuracy Extraction¶

config = PipelineConfig(
    source="complex_document.pdf",
    template="templates.ScholarlyRheologyPaper",
    backend="vlm",              # Vision backend
    processing_mode="one-to-one",
    docling_config="vision"     # Vision pipeline
)

run_pipeline(config)

📍 Optimized for Large Documents¶

config = PipelineConfig(
    source="large_document.pdf",
    template="templates.Contract",
    backend="llm",
    use_chunking=True,          # Enable chunking
    max_batch_size=3            # Smaller batches
)

run_pipeline(config)

Best Practices¶

👍 Choose the Right Backend¶

# ✅ Good - Match backend to document type
if document_has_complex_layout:
    backend = "vlm"
else:
    backend = "llm"

👍 Enable Chunking for Large Documents¶

# ✅ Good - Use chunking for efficiency
config = PipelineConfig(
    source="large_doc.pdf",
    template="templates.BillingDocument",
    use_chunking=True  # Recommended
)

Troubleshooting¶

🐛 Extraction Returns Empty Results¶

Solution:

# Check document conversion
processor = DocumentProcessor()
document = processor.convert_to_docling_doc("document.pdf")
markdown = processor.extract_full_markdown(document)

if not markdown.strip():
    print("Document conversion failed")

🐛 Out of Memory¶

Solution:

# Enable chunking and reduce batch size
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    use_chunking=True,
    max_batch_size=1  # Smaller batches
)

🐛 Slow Extraction¶

Solution:

# Use local backend for faster inference
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="local"
)

Next Steps¶

Ready to dive deeper? Start with:

Document Conversion → - Learn about Docling pipelines
Chunking Strategies → - Optimize document splitting
Extraction Backends → - Choose the right backend

The Extraction Process¶

Overview¶

The Four-Stage Pipeline¶

Stage 1: Document Conversion¶

Stage 2: Chunking¶

Stage 3: Extraction¶

Stage 4: Merging¶

Processing Modes¶

Many-to-One (Default)¶

One-to-One¶

Backend Comparison¶

Pipeline Stages in Code¶

Stage Overview¶

Orchestration¶

Extraction Flow¶

Complete Flow Diagram¶

Key Concepts¶

1. Document Conversion¶

2. Chunking¶

3. Extraction¶

4. Merging¶

Performance Optimization¶

Chunking vs No Chunking¶

Batch Processing¶

Error Handling¶

Extraction Errors¶

Pipeline Errors¶

Section Contents¶

1. Document Conversion¶

2. Chunking Strategies¶

3. Extraction Backends¶

4. Model Merging¶

5. Batch Processing¶

6. Pipeline Orchestration¶

Quick Examples¶

📍 Basic Extraction¶

📍 High-Accuracy Extraction¶

📍 Optimized for Large Documents¶

Best Practices¶

👍 Choose the Right Backend¶

👍 Enable Chunking for Large Documents¶

Troubleshooting¶

🐛 Extraction Returns Empty Results¶

🐛 Out of Memory¶

🐛 Slow Extraction¶

Next Steps¶

1. Document Conversion ¶

2. Chunking Strategies ¶

3. Extraction Backends ¶

4. Model Merging ¶

5. Batch Processing ¶