Skip to content

Advanced Topics

Overview

This section covers advanced topics for extending and optimizing docling-graph. These guides are for users who need to:

  • Create custom extraction backends
  • Build custom exporters
  • Add pipeline stages
  • Optimize performance
  • Handle errors gracefully
  • Test templates and pipelines

Topics

🧩 Extensibility

Custom Backends
Create custom extraction backends for specialized models or APIs.

  • Implement backend protocols
  • VLM backend example
  • LLM backend example
  • Integration with pipeline

Custom Exporters
Build custom exporters for specialized output formats.

  • Implement exporter protocol
  • Graph data access
  • Custom format generation
  • Registration and usage

Custom Stages
Add custom stages to the pipeline for specialized processing.

  • Pipeline stage protocol
  • Stage implementation
  • Context management
  • Error handling

📐 Optimization

Performance Tuning
Optimize extraction speed and resource usage.

  • Model selection strategies
  • Batch size optimization
  • Memory management
  • GPU utilization
  • Caching strategies

🛡️ Reliability

Error Handling
Handle errors gracefully and implement retry logic.

  • Exception hierarchy
  • Error recovery strategies
  • Logging and debugging
  • Retry mechanisms

Testing
Test templates, backends, and pipelines.

  • Template validation
  • Mock backends
  • Integration testing
  • CI/CD integration

Prerequisites

Before diving into advanced topics, ensure you understand:

  1. Schema Definition - Pydantic templates
  2. Pipeline Configuration - Configuration options
  3. Extraction Process - How extraction works
  4. Python API - Programmatic usage

When to Use Advanced Features

Custom Backends

Use when:
✅ You have a specialized model not supported by default
✅ You need to integrate with a proprietary API
✅ You want to implement custom preprocessing
✅ You need fine-grained control over extraction

Don't use when:
❌ Default backends meet your needs
❌ You're just starting with docling-graph
❌ You don't need custom logic

Custom Exporters

Use when:
✅ You need a specialized output format
✅ You're integrating with a specific database
✅ You need custom data transformations
✅ Default formats don't meet requirements

Don't use when:
❌ CSV, Cypher, or JSON formats work
❌ You can post-process existing exports
❌ You're prototyping

Custom Stages

Use when:
✅ You need custom preprocessing
✅ You want to add validation steps
✅ You need custom post-processing
✅ You're building a specialized pipeline

Don't use when:
❌ Default pipeline stages suffice
❌ You can achieve goals with configuration
❌ You're learning the system


Architecture

Extension Points

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%% flowchart TB %% 1. Define Classes classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1 classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037 classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20 classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100 classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0 classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

%% 2. Define Nodes
A@{ shape: terminal, label: "Input Source" }

B@{ shape: lin-proc, label: "Custom Stage 1" }
C@{ shape: procs, label: "Docling Conversion" }
D@{ shape: tag-proc, label: "Custom Backend" }
E@{ shape: procs, label: "Extraction" }
F@{ shape: lin-proc, label: "Custom Stage 2" }
G@{ shape: procs, label: "Graph Conversion" }
H@{ shape: tag-proc, label: "Custom Exporter" }

I@{ shape: doc, label: "Output" }

%% 3. Define Connections
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
H --> I

%% 4. Apply Classes
class A input
class B,F config
class C,E,G process
class D,H operator
class I output

``` Extension Points: - Custom Backends (blue): Replace extraction logic - Custom Exporters (blue): Replace export logic - Custom Stages (yellow): Add processing steps


Code Organization

Project Structure for Extensions

my_project/ ├── templates/ # Pydantic templates │ └── my_template.py ├── backends/ # Custom backends │ ├── __init__.py │ └── my_backend.py ├── exporters/ # Custom exporters │ ├── __init__.py │ └── my_exporter.py ├── stages/ # Custom stages │ ├── __init__.py │ └── my_stage.py ├── tests/ # Tests │ ├── test_backend.py │ ├── test_exporter.py │ └── test_stage.py └── main.py # Entry point


Development Workflow

1. Design

```python

Define interface

from docling_graph.protocols import TextExtractionBackendProtocol class MyBackend(TextExtractionBackendProtocol): """Custom backend implementation.""" pass ```

2. Implement

# Implement methods
def extract_from_markdown(self, markdown: str, template, context="", is_partial=False):
    """Extract structured data."""
    # Your logic here
    pass

3. Test

# Write tests
def test_my_backend():
    backend = MyBackend()
    result = backend.extract_from_markdown("test", MyTemplate)
    assert result is not None

4. Integrate

# Use in pipeline
from docling_graph import PipelineConfig
from my_backends import MyBackend

config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate",
    # Custom backend integration
)

Best Practices

👍 Follow Protocols

# ✅ Good - Implement protocol
from docling_graph.protocols import TextExtractionBackendProtocol

class MyBackend(TextExtractionBackendProtocol):
    def extract_from_markdown(self, ...): ...
    def consolidate_from_pydantic_models(self, ...): ...
    def cleanup(self): ...

# ❌ Avoid - Custom interface
class MyBackend:
    def my_custom_method(self, ...): ...

👍 Handle Errors

# ✅ Good - Use docling-graph exceptions
from docling_graph.exceptions import ExtractionError

def extract(self, ...):
    try:
        result = self._process()
        return result
    except Exception as e:
        raise ExtractionError(
            "Extraction failed",
            details={"source": source},
            cause=e
        )

# ❌ Avoid - Generic exceptions
def extract(self, ...):
    raise Exception("Something went wrong")

👍 Write Tests

# ✅ Good - Comprehensive tests
def test_backend_success():
    """Test successful extraction."""
    pass

def test_backend_failure():
    """Test error handling."""
    pass

def test_backend_cleanup():
    """Test resource cleanup."""
    pass

# ❌ Avoid - No tests
# (No tests written)

👍 Document Code

# ✅ Good - Clear documentation
class MyBackend:
    """
    Custom backend for specialized extraction.

    This backend uses a proprietary model to extract
    structured data from documents.

    Args:
        api_key: API key for the service
        model: Model name to use

    Example:
        >>> backend = MyBackend(api_key="key", model="model-v1")
        >>> result = backend.extract_from_markdown(text, Template)
    """
    pass

# ❌ Avoid - No documentation
class MyBackend:
    pass

Performance Considerations

Memory Management

# ✅ Good - Clean up resources
class MyBackend:
    def cleanup(self):
        """Release resources."""
        if hasattr(self, 'model'):
            del self.model
        if hasattr(self, 'client'):
            self.client.close()

# ❌ Avoid - Memory leaks
class MyBackend:
    def cleanup(self):
        pass  # Resources not released

Batch Processing

# ✅ Good - Process in batches
def process_documents(docs):
    batch_size = 10
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i+batch_size]
        process_batch(batch)

# ❌ Avoid - Process all at once
def process_documents(docs):
    process_all(docs)  # May run out of memory

Security Considerations

API Keys

# ✅ Good - Use environment variables
import os

api_key = os.getenv("MY_API_KEY")
if not api_key:
    raise ValueError("MY_API_KEY not set")

# ❌ Avoid - Hardcoded keys
api_key = "sk-1234567890"  # Never do this!

Input Validation

# ✅ Good - Validate inputs
def extract(self, markdown: str, template):
    if not markdown:
        raise ValueError("Markdown cannot be empty")
    if not template:
        raise ValueError("Template is required")
    # Process...

# ❌ Avoid - No validation
def extract(self, markdown, template):
    # Process without checks
    pass

Next Steps

Choose a topic based on your needs:

  1. Custom Backends → - Extend extraction capabilities
  2. Custom Exporters → - Create custom output formats
  3. Custom Stages → - Add pipeline stages
  4. Performance Tuning → - Optimize performance
  5. Error Handling → - Handle errors gracefully
  6. Testing → - Test your extensions