Advanced Topics¶

Overview¶

This section covers advanced topics for extending and optimizing docling-graph. These guides are for users who need to:

Create custom extraction backends
Build custom exporters
Add pipeline stages
Optimize performance
Handle errors gracefully
Test templates and pipelines

Topics¶

🧩 Extensibility¶

Custom Backends
Create custom extraction backends for specialized models or APIs.

Implement backend protocols
VLM backend example
LLM backend example
Integration with pipeline

Custom Exporters
Build custom exporters for specialized output formats.

Implement exporter protocol
Graph data access
Custom format generation
Registration and usage

Custom Stages
Add custom stages to the pipeline for specialized processing.

Pipeline stage protocol
Stage implementation
Context management
Error handling

📐 Optimization¶

Performance Tuning
Optimize extraction speed and resource usage.

Model selection strategies
Batch size optimization
Memory management
GPU utilization
Caching strategies

🛡️ Reliability¶

Error Handling
Handle errors gracefully and implement retry logic.

Exception hierarchy
Error recovery strategies
Logging and debugging
Retry mechanisms

Testing
Test templates, backends, and pipelines.

Template validation
Mock backends
Integration testing
CI/CD integration

Prerequisites¶

Before diving into advanced topics, ensure you understand:

Schema Definition - Pydantic templates
Pipeline Configuration - Configuration options
Extraction Process - How extraction works
Python API - Programmatic usage

When to Use Advanced Features¶

Custom Backends¶

Use when:
✅ You have a specialized model not supported by default
✅ You need to integrate with a proprietary API
✅ You want to implement custom preprocessing
✅ You need fine-grained control over extraction

Don't use when:
❌ Default backends meet your needs
❌ You're just starting with docling-graph
❌ You don't need custom logic

Custom Exporters¶

Use when:
✅ You need a specialized output format
✅ You're integrating with a specific database
✅ You need custom data transformations
✅ Default formats don't meet requirements

Don't use when:
❌ CSV, Cypher, or JSON formats work
❌ You can post-process existing exports
❌ You're prototyping

Custom Stages¶

Use when:
✅ You need custom preprocessing
✅ You want to add validation steps
✅ You need custom post-processing
✅ You're building a specialized pipeline

Don't use when:
❌ Default pipeline stages suffice
❌ You can achieve goals with configuration
❌ You're learning the system

Architecture¶

Extension Points¶

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%% flowchart TB %% 1. Define Classes classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1 classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037 classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20 classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100 classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0 classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

%% 2. Define Nodes
A@{ shape: terminal, label: "Input Source" }

B@{ shape: lin-proc, label: "Custom Stage 1" }
C@{ shape: procs, label: "Docling Conversion" }
D@{ shape: tag-proc, label: "Custom Backend" }
E@{ shape: procs, label: "Extraction" }
F@{ shape: lin-proc, label: "Custom Stage 2" }
G@{ shape: procs, label: "Graph Conversion" }
H@{ shape: tag-proc, label: "Custom Exporter" }

I@{ shape: doc, label: "Output" }

%% 3. Define Connections
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
H --> I

%% 4. Apply Classes
class A input
class B,F config
class C,E,G process
class D,H operator
class I output

``` Extension Points: - Custom Backends (blue): Replace extraction logic - Custom Exporters (blue): Replace export logic - Custom Stages (yellow): Add processing steps

Code Organization¶

Project Structure for Extensions¶

my_project/ ├── templates/ # Pydantic templates │ └── my_template.py ├── backends/ # Custom backends │ ├── __init__.py │ └── my_backend.py ├── exporters/ # Custom exporters │ ├── __init__.py │ └── my_exporter.py ├── stages/ # Custom stages │ ├── __init__.py │ └── my_stage.py ├── tests/ # Tests │ ├── test_backend.py │ ├── test_exporter.py │ └── test_stage.py └── main.py # Entry point

Development Workflow¶

1. Design¶

```python

Define interface¶

from docling_graph.protocols import TextExtractionBackendProtocol class MyBackend(TextExtractionBackendProtocol): """Custom backend implementation.""" pass ```

2. Implement¶

# Implement methods
def extract_from_markdown(self, markdown: str, template, context="", is_partial=False):
    """Extract structured data."""
    # Your logic here
    pass

3. Test¶

# Write tests
def test_my_backend():
    backend = MyBackend()
    result = backend.extract_from_markdown("test", MyTemplate)
    assert result is not None

4. Integrate¶

# Use in pipeline
from docling_graph import PipelineConfig
from my_backends import MyBackend

config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate",
    # Custom backend integration
)

Best Practices¶

👍 Follow Protocols¶

# ✅ Good - Implement protocol
from docling_graph.protocols import TextExtractionBackendProtocol

class MyBackend(TextExtractionBackendProtocol):
    def extract_from_markdown(self, ...): ...
    def consolidate_from_pydantic_models(self, ...): ...
    def cleanup(self): ...

# ❌ Avoid - Custom interface
class MyBackend:
    def my_custom_method(self, ...): ...

👍 Handle Errors¶

# ✅ Good - Use docling-graph exceptions
from docling_graph.exceptions import ExtractionError

def extract(self, ...):
    try:
        result = self._process()
        return result
    except Exception as e:
        raise ExtractionError(
            "Extraction failed",
            details={"source": source},
            cause=e
        )

# ❌ Avoid - Generic exceptions
def extract(self, ...):
    raise Exception("Something went wrong")

👍 Write Tests¶

# ✅ Good - Comprehensive tests
def test_backend_success():
    """Test successful extraction."""
    pass

def test_backend_failure():
    """Test error handling."""
    pass

def test_backend_cleanup():
    """Test resource cleanup."""
    pass

# ❌ Avoid - No tests
# (No tests written)

👍 Document Code¶

# ✅ Good - Clear documentation
class MyBackend:
    """
    Custom backend for specialized extraction.

    This backend uses a proprietary model to extract
    structured data from documents.

    Args:
        api_key: API key for the service
        model: Model name to use

    Example:
        >>> backend = MyBackend(api_key="key", model="model-v1")
        >>> result = backend.extract_from_markdown(text, Template)
    """
    pass

# ❌ Avoid - No documentation
class MyBackend:
    pass

Performance Considerations¶

Memory Management¶

# ✅ Good - Clean up resources
class MyBackend:
    def cleanup(self):
        """Release resources."""
        if hasattr(self, 'model'):
            del self.model
        if hasattr(self, 'client'):
            self.client.close()

# ❌ Avoid - Memory leaks
class MyBackend:
    def cleanup(self):
        pass  # Resources not released

Batch Processing¶

# ✅ Good - Process in batches
def process_documents(docs):
    batch_size = 10
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i+batch_size]
        process_batch(batch)

# ❌ Avoid - Process all at once
def process_documents(docs):
    process_all(docs)  # May run out of memory

Security Considerations¶

API Keys¶

# ✅ Good - Use environment variables
import os

api_key = os.getenv("MY_API_KEY")
if not api_key:
    raise ValueError("MY_API_KEY not set")

# ❌ Avoid - Hardcoded keys
api_key = "sk-1234567890"  # Never do this!

Input Validation¶

# ✅ Good - Validate inputs
def extract(self, markdown: str, template):
    if not markdown:
        raise ValueError("Markdown cannot be empty")
    if not template:
        raise ValueError("Template is required")
    # Process...

# ❌ Avoid - No validation
def extract(self, markdown, template):
    # Process without checks
    pass

Next Steps¶

Choose a topic based on your needs:

Custom Backends → - Extend extraction capabilities
Custom Exporters → - Create custom output formats
Custom Stages → - Add pipeline stages
Performance Tuning → - Optimize performance
Error Handling → - Handle errors gracefully
Testing → - Test your extensions