Advanced Topics¶
Overview¶
This section covers advanced topics for extending and optimizing docling-graph. These guides are for users who need to:
- Create custom extraction backends
- Build custom exporters
- Add pipeline stages
- Optimize performance
- Handle errors gracefully
- Test templates and pipelines
Topics¶
🧩 Extensibility¶
Custom Backends
Create custom extraction backends for specialized models or APIs.
- Implement backend protocols
- VLM backend example
- LLM backend example
- Integration with pipeline
Custom Exporters
Build custom exporters for specialized output formats.
- Implement exporter protocol
- Graph data access
- Custom format generation
- Registration and usage
Custom Stages
Add custom stages to the pipeline for specialized processing.
- Pipeline stage protocol
- Stage implementation
- Context management
- Error handling
📐 Optimization¶
Performance Tuning
Optimize extraction speed and resource usage.
- Model selection strategies
- Batch size optimization
- Memory management
- GPU utilization
- Caching strategies
🛡️ Reliability¶
Error Handling
Handle errors gracefully and implement retry logic.
- Exception hierarchy
- Error recovery strategies
- Logging and debugging
- Retry mechanisms
Testing
Test templates, backends, and pipelines.
- Template validation
- Mock backends
- Integration testing
- CI/CD integration
Prerequisites¶
Before diving into advanced topics, ensure you understand:
- Schema Definition - Pydantic templates
- Pipeline Configuration - Configuration options
- Extraction Process - How extraction works
- Python API - Programmatic usage
When to Use Advanced Features¶
Custom Backends¶
Use when:
✅ You have a specialized model not supported by default
✅ You need to integrate with a proprietary API
✅ You want to implement custom preprocessing
✅ You need fine-grained control over extraction
Don't use when:
❌ Default backends meet your needs
❌ You're just starting with docling-graph
❌ You don't need custom logic
Custom Exporters¶
Use when:
✅ You need a specialized output format
✅ You're integrating with a specific database
✅ You need custom data transformations
✅ Default formats don't meet requirements
Don't use when:
❌ CSV, Cypher, or JSON formats work
❌ You can post-process existing exports
❌ You're prototyping
Custom Stages¶
Use when:
✅ You need custom preprocessing
✅ You want to add validation steps
✅ You need custom post-processing
✅ You're building a specialized pipeline
Don't use when:
❌ Default pipeline stages suffice
❌ You can achieve goals with configuration
❌ You're learning the system
Architecture¶
Extension Points¶
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%% flowchart TB %% 1. Define Classes classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1 classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037 classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20 classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100 classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0 classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
A@{ shape: terminal, label: "Input Source" }
B@{ shape: lin-proc, label: "Custom Stage 1" }
C@{ shape: procs, label: "Docling Conversion" }
D@{ shape: tag-proc, label: "Custom Backend" }
E@{ shape: procs, label: "Extraction" }
F@{ shape: lin-proc, label: "Custom Stage 2" }
G@{ shape: procs, label: "Graph Conversion" }
H@{ shape: tag-proc, label: "Custom Exporter" }
I@{ shape: doc, label: "Output" }
%% 3. Define Connections
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
H --> I
%% 4. Apply Classes
class A input
class B,F config
class C,E,G process
class D,H operator
class I output
``` Extension Points: - Custom Backends (blue): Replace extraction logic - Custom Exporters (blue): Replace export logic - Custom Stages (yellow): Add processing steps
Code Organization¶
Project Structure for Extensions¶
my_project/
├── templates/ # Pydantic templates
│ └── my_template.py
├── backends/ # Custom backends
│ ├── __init__.py
│ └── my_backend.py
├── exporters/ # Custom exporters
│ ├── __init__.py
│ └── my_exporter.py
├── stages/ # Custom stages
│ ├── __init__.py
│ └── my_stage.py
├── tests/ # Tests
│ ├── test_backend.py
│ ├── test_exporter.py
│ └── test_stage.py
└── main.py # Entry point
Development Workflow¶
1. Design¶
```python
Define interface¶
from docling_graph.protocols import TextExtractionBackendProtocol class MyBackend(TextExtractionBackendProtocol): """Custom backend implementation.""" pass ```
2. Implement¶
# Implement methods
def extract_from_markdown(self, markdown: str, template, context="", is_partial=False):
"""Extract structured data."""
# Your logic here
pass
3. Test¶
# Write tests
def test_my_backend():
backend = MyBackend()
result = backend.extract_from_markdown("test", MyTemplate)
assert result is not None
4. Integrate¶
# Use in pipeline
from docling_graph import PipelineConfig
from my_backends import MyBackend
config = PipelineConfig(
source="document.pdf",
template="templates.MyTemplate",
# Custom backend integration
)
Best Practices¶
👍 Follow Protocols¶
# ✅ Good - Implement protocol
from docling_graph.protocols import TextExtractionBackendProtocol
class MyBackend(TextExtractionBackendProtocol):
def extract_from_markdown(self, ...): ...
def consolidate_from_pydantic_models(self, ...): ...
def cleanup(self): ...
# ❌ Avoid - Custom interface
class MyBackend:
def my_custom_method(self, ...): ...
👍 Handle Errors¶
# ✅ Good - Use docling-graph exceptions
from docling_graph.exceptions import ExtractionError
def extract(self, ...):
try:
result = self._process()
return result
except Exception as e:
raise ExtractionError(
"Extraction failed",
details={"source": source},
cause=e
)
# ❌ Avoid - Generic exceptions
def extract(self, ...):
raise Exception("Something went wrong")
👍 Write Tests¶
# ✅ Good - Comprehensive tests
def test_backend_success():
"""Test successful extraction."""
pass
def test_backend_failure():
"""Test error handling."""
pass
def test_backend_cleanup():
"""Test resource cleanup."""
pass
# ❌ Avoid - No tests
# (No tests written)
👍 Document Code¶
# ✅ Good - Clear documentation
class MyBackend:
"""
Custom backend for specialized extraction.
This backend uses a proprietary model to extract
structured data from documents.
Args:
api_key: API key for the service
model: Model name to use
Example:
>>> backend = MyBackend(api_key="key", model="model-v1")
>>> result = backend.extract_from_markdown(text, Template)
"""
pass
# ❌ Avoid - No documentation
class MyBackend:
pass
Performance Considerations¶
Memory Management¶
# ✅ Good - Clean up resources
class MyBackend:
def cleanup(self):
"""Release resources."""
if hasattr(self, 'model'):
del self.model
if hasattr(self, 'client'):
self.client.close()
# ❌ Avoid - Memory leaks
class MyBackend:
def cleanup(self):
pass # Resources not released
Batch Processing¶
# ✅ Good - Process in batches
def process_documents(docs):
batch_size = 10
for i in range(0, len(docs), batch_size):
batch = docs[i:i+batch_size]
process_batch(batch)
# ❌ Avoid - Process all at once
def process_documents(docs):
process_all(docs) # May run out of memory
Security Considerations¶
API Keys¶
# ✅ Good - Use environment variables
import os
api_key = os.getenv("MY_API_KEY")
if not api_key:
raise ValueError("MY_API_KEY not set")
# ❌ Avoid - Hardcoded keys
api_key = "sk-1234567890" # Never do this!
Input Validation¶
# ✅ Good - Validate inputs
def extract(self, markdown: str, template):
if not markdown:
raise ValueError("Markdown cannot be empty")
if not template:
raise ValueError("Template is required")
# Process...
# ❌ Avoid - No validation
def extract(self, markdown, template):
# Process without checks
pass
Next Steps¶
Choose a topic based on your needs:
- Custom Backends → - Extend extraction capabilities
- Custom Exporters → - Create custom output formats
- Custom Stages → - Add pipeline stages
- Performance Tuning → - Optimize performance
- Error Handling → - Handle errors gracefully
- Testing → - Test your extensions