The Extraction Process¶
Overview¶
The Extraction Process is the core of Docling Graph, transforming raw documents into structured knowledge graphs through a multi-stage pipeline. This section explains each stage in detail.
What you'll learn: - How documents are converted to structured format - Intelligent chunking strategies - Extraction backends (LLM vs VLM) - Model merging and consolidation - Pipeline orchestration
The Four-Stage Pipeline¶
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
%% 1. Define Classes
classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
A@{ shape: terminal, label: "Input Source" }
A1@{ shape: tag-proc, label: "Input Normalization" }
B@{ shape: procs, label: "Conversion" }
C@{ shape: tag-proc, label: "Chunking" }
D@{ shape: procs, label: "Extraction" }
E@{ shape: lin-proc, label: "Merging" }
F@{ shape: db, label: "Knowledge Graph" }
%% 3. Define Connections
A --> A1
A1 --> B
B --> C
C --> D
D --> E
E --> F
%% 4. Apply Classes
class A input
class A1,C operator
class B,D,E process
class F output
Stage 1: Document Conversion¶
Purpose: Convert PDF/images to structured Docling format
Process: - OCR or Vision pipeline - Layout analysis - Table extraction - Text extraction
Output: DoclingDocument with structure
Learn more: Document Conversion →
Stage 2: Chunking¶
Purpose: Split document into optimal chunks for LLM processing
Process: - Structure-aware splitting - Token counting - Semantic boundaries - Context preservation
Output: List of contextualized chunks
Learn more: Chunking Strategies →
Stage 3: Extraction¶
Purpose: Extract structured data using LLM/VLM
Process: - Backend selection (LLM/VLM) - Batch processing - Schema validation - Error handling
Output: List of Pydantic models
Learn more: Extraction Backends →
Stage 4: Merging¶
Purpose: Consolidate multiple extractions into single model
Process: - Programmatic merging - LLM consolidation (optional) - Conflict resolution - Validation
Output: Single consolidated model
Learn more: Model Merging →
Processing Modes¶
Many-to-One (Default)¶
Best for: Most documents
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
processing_mode="many-to-one" # Default
)
Process: 1. Convert entire document 2. Chunk intelligently 3. Extract from each chunk 4. Merge into single model
Output: 1 consolidated model
One-to-One¶
Best for: Multi-page forms, page-specific data
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
processing_mode="one-to-one"
)
Process: 1. Convert entire document 2. Extract from each page 3. Return separate models
Output: N models (one per page)
Backend Comparison¶
| Feature | LLM Backend | VLM Backend |
|---|---|---|
| Input | Markdown text | Images/PDFs |
| Accuracy | High for text | High for visuals |
| Speed | Fast | Slower |
| Cost | Low (local) | Medium |
| Best For | Text documents | Complex layouts |
Pipeline Stages in Code¶
Stage Overview¶
from docling_graph.pipeline.stages import (
TemplateLoadingStage, # Load Pydantic template
ExtractionStage, # Extract data
DoclingExportStage, # Export Docling outputs
GraphConversionStage, # Convert to graph
ExportStage, # Export graph
VisualizationStage # Generate visualizations
)
Orchestration¶
from docling_graph.pipeline.orchestrator import PipelineOrchestrator
orchestrator = PipelineOrchestrator(config)
context = orchestrator.run()
# Access results
print(f"Extracted {len(context.extracted_models)} models")
print(f"Graph has {context.graph_metadata.node_count} nodes")
Extraction Flow¶
Complete Flow Diagram¶
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%% flowchart TD %% 1. Define Classes classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1 classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037 classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20 classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100 classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0 classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
Start@{ shape: terminal, label: "Input Source" }
Normalize@{ shape: procs, label: "Input Normalization" }
CheckInput{"Input Type"}
Convert@{ shape: procs, label: "Document Conversion<br/>PDF/Image" }
TextProc@{ shape: lin-proc, label: "Text Processing<br/>Text/Markdown" }
LoadDoc@{ shape: lin-proc, label: "Load DoclingDocument<br/>Skip to Graph" }
CheckMode{"Process. Mode"}
CheckChunk{"Chunking?"}
PageExtract@{ shape: lin-proc, label: "Page-by-Page Extraction" }
FullDoc@{ shape: lin-proc, label: "Full Document Extraction" }
Chunk@{ shape: tag-proc, label: "Structure-Aware Chunking" }
Batch@{ shape: tag-proc, label: "Batch Chunks" }
Extract@{ shape: procs, label: "Extract from Batches" }
CheckMerge{"Multiple Models?"}
Merge@{ shape: lin-proc, label: "Programmatic Merge" }
Single@{ shape: doc, label: "Single Model" }
CheckConsol{"Consolidation?"}
Consol@{ shape: procs, label: "LLM Consolidation" }
Final@{ shape: doc, label: "Final Model" }
Graph@{ shape: db, label: "Knowledge Graph" }
%% 3. Define Connections
Start --> Normalize
Normalize --> CheckInput
CheckInput -- "PDF/Image" --> Convert
CheckInput -- "Text/Markdown" --> TextProc
CheckInput -- "DoclingDocument" --> LoadDoc
Convert --> CheckMode
TextProc --> CheckMode
LoadDoc --> Graph
CheckMode -- Many-to-One --> CheckChunk
CheckMode -- One-to-One --> PageExtract
CheckChunk -- Yes --> Chunk
CheckChunk -- No --> FullDoc
Chunk --> Batch
Batch --> Extract
FullDoc --> Extract
PageExtract --> Extract
Extract --> CheckMerge
CheckMerge -- Yes --> Merge
CheckMerge -- No --> Single
Merge --> CheckConsol
CheckConsol -- Yes --> Consol
CheckConsol -- No --> Final
Consol --> Final
Single --> Final
Final --> Graph
%% 4. Apply Classes
class Start input
class Normalize,Convert,Extract,Consol process
class TextProc,LoadDoc,PageExtract,FullDoc,Merge process
class Chunk,Batch operator
class CheckInput,CheckMode,CheckChunk,CheckMerge,CheckConsol decision
class Single data
class Final,Graph output
```
Key Concepts¶
1. Document Conversion¶
Transform raw documents into structured format:
python
from docling_graph.core.extractors import DocumentProcessor
processor = DocumentProcessor(docling_config="ocr")
document = processor.convert_to_docling_doc("document.pdf")
Learn more: Document Conversion →
2. Chunking¶
Split documents intelligently:
from docling_graph.core.extractors import DocumentChunker
chunker = DocumentChunker(
tokenizer_name="sentence-transformers/all-MiniLM-L6-v2",
chunk_max_tokens=512
)
chunks = chunker.chunk_document(document)
Learn more: Chunking Strategies →
3. Extraction¶
Extract structured data:
from docling_graph.core.extractors import ExtractorFactory
extractor = ExtractorFactory.create_extractor(
processing_mode="many-to-one",
backend_name="llm",
llm_client=client
)
models, doc = extractor.extract(source, template)
Learn more: Extraction Backends →
4. Merging¶
Consolidate multiple models:
from docling_graph.core.utils import merge_pydantic_models
merged = merge_pydantic_models(models, template)
Learn more: Model Merging →
Performance Optimization¶
Chunking vs No Chunking¶
| Approach | Speed | Accuracy | Memory | Best For |
|---|---|---|---|---|
| Chunking | Fast | High | Low | Large docs |
| No Chunking | Slow | Medium | High | Small docs |
Batch Processing¶
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
use_chunking=True, # Enable chunking
max_batch_size=5 # Process 5 chunks at once
)
Error Handling¶
Extraction Errors¶
from docling_graph.exceptions import ExtractionError
try:
run_pipeline(config)
except ExtractionError as e:
print(f"Extraction failed: {e.message}")
print(f"Details: {e.details}")
Pipeline Errors¶
from docling_graph.exceptions import PipelineError
try:
run_pipeline(config)
except PipelineError as e:
print(f"Pipeline failed at stage: {e.details['stage']}")
Section Contents¶
1. Document Conversion¶
Learn how documents are converted to structured format using Docling pipelines.
Topics: - OCR vs Vision pipelines - Layout analysis - Table extraction - Multi-language support
2. Chunking Strategies¶
Understand intelligent document chunking for optimal LLM processing.
Topics: - Structure-aware chunking - Token management - Semantic boundaries - Provider-specific optimization
3. Extraction Backends¶
Deep dive into LLM and VLM extraction backends.
Topics: - LLM backend (text-based) - VLM backend (vision-based) - Backend selection - Performance comparison
4. Model Merging¶
Learn how multiple extractions are consolidated into single models.
Topics: - Programmatic merging - LLM consolidation - Conflict resolution - Validation strategies
5. Batch Processing¶
Optimize extraction with intelligent batching.
Topics: - Chunk batching - Context window management - Adaptive batch sizing - Performance tuning
6. Pipeline Orchestration¶
Understand how pipeline stages are coordinated through the extraction process.
Topics: - Stage execution - Context management - Error handling - Resource cleanup
Quick Examples¶
📍 Basic Extraction¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="local"
)
run_pipeline(config)
📍 High-Accuracy Extraction¶
config = PipelineConfig(
source="complex_document.pdf",
template="templates.ScholarlyRheologyPaper",
backend="vlm", # Vision backend
processing_mode="one-to-one",
docling_config="vision" # Vision pipeline
)
run_pipeline(config)
📍 Optimized for Large Documents¶
config = PipelineConfig(
source="large_document.pdf",
template="templates.Contract",
backend="llm",
use_chunking=True, # Enable chunking
max_batch_size=3 # Smaller batches
)
run_pipeline(config)
Best Practices¶
👍 Choose the Right Backend¶
# ✅ Good - Match backend to document type
if document_has_complex_layout:
backend = "vlm"
else:
backend = "llm"
👍 Enable Chunking for Large Documents¶
# ✅ Good - Use chunking for efficiency
config = PipelineConfig(
source="large_doc.pdf",
template="templates.BillingDocument",
use_chunking=True # Recommended
)
Troubleshooting¶
🐛 Extraction Returns Empty Results¶
Solution:
# Check document conversion
processor = DocumentProcessor()
document = processor.convert_to_docling_doc("document.pdf")
markdown = processor.extract_full_markdown(document)
if not markdown.strip():
print("Document conversion failed")
🐛 Out of Memory¶
Solution:
# Enable chunking and reduce batch size
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
use_chunking=True,
max_batch_size=1 # Smaller batches
)
🐛 Slow Extraction¶
Solution:
# Use local backend for faster inference
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="local"
)
Next Steps¶
Ready to dive deeper? Start with:
- Document Conversion → - Learn about Docling pipelines
- Chunking Strategies → - Optimize document splitting
- Extraction Backends → - Choose the right backend