Skip to content

Architecture

System Architecture

Docling Graph follows a modular, pipeline-based architecture with clear separation of concerns:

Architecture

Core Components

Document Processor

Converts documents to structured format using Docling with OCR or Vision pipelines.

Location: docling_graph/core/extractors/document_processor.py

Extraction Backends

VLM Backend: Direct extraction from images using vision-language models (local only) LLM Backend: Text-based extraction supporting local (vLLM, Ollama) and remote APIs

Location: docling_graph/core/extractors/backends/

Processing Strategies

One-to-One: Each page produces a separate model (invoice batches, ID cards) Many-to-One: Multiple pages merged into single model (rheology researchs, reports)

Location: docling_graph/core/extractors/strategies/

Document Chunker

Splits large documents while preserving semantic coherence and respecting structure.

Location: docling_graph/core/extractors/document_chunker.py

Graph Converter

Transforms Pydantic models to NetworkX graphs with stable node IDs and automatic deduplication.

Location: docling_graph/core/converters/graph_converter.py

Exporters & Visualizers

Export graphs in CSV, Cypher, JSON formats and generate interactive HTML visualizations.

Location: docling_graph/core/exporters/, docling_graph/core/visualizers/

Complete Pipeline Flow

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart TB
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    A@{ shape: terminal, label: "Input Source" }
    A1@{ shape: procs, label: "1. Input Normalization<br/>Type Detection & Validation" }

    A2{"Input Type"}

    %% Ingestion Paths
    B@{ shape: procs, label: "2a. Docling Conversion<br/>Generates Images & Markdown" }
    B2@{ shape: lin-proc, label: "2b. Text Processing<br/>Direct to Markdown" }
    B3@{ shape: lin-proc, label: "2c. Load DoclingDocument<br/>Pre-parsed Content" }

    %% Strategy Decision
    C{"3. Backend"}

    %% Extraction Paths
    D@{ shape: lin-proc, label: "4a. VLM Extraction<br/>Page-by-Page (Images)" }
    E@{ shape: lin-proc, label: "4b. Markdown Prep<br/>Merge Text Content" }

    %% Chunking Logic (LLM Path)
    F{"5. Chunking"}
    G@{ shape: tag-proc, label: "6a. Hybrid Chunking<br/>Semantic + Token-Aware" }
    H@{ shape: tag-proc, label: "6b. Full Document<br/>Context Window Permitting" }

    I@{ shape: procs, label: "7. Batch Extraction<br/>LLM Inference" }

    %% Convergence & Validation
    J@{ shape: tag-proc, label: "8. Pydantic Validation<br/>Per-Chunk/Page Check" }

    K{"9. Consolidation"}

    L@{ shape: lin-proc, label: "10a. Smart Merge<br/>Programmatic/Reduce" }
    M@{ shape: lin-proc, label: "10b. LLM Consolidation<br/>Refinement Loop" }

    %% Graph & Export
    N@{ shape: procs, label: "11. Graph Conversion<br/>Pydantic → NetworkX" }
    O@{ shape: tag-proc, label: "12. Node ID Generation<br/>Stable Hashing" }

    P@{ shape: tag-proc, label: "13. Export<br/>CSV/Cypher/JSON" }
    Q@{ shape: tag-proc, label: "14. Visualization<br/>HTML + Reports" }

    %% 3. Define Connections
    A --> A1
    A1 --> A2

    %% Routing Inputs
    A2 -- "PDF/Image" --> B
    A2 -- "Text/MD" --> B2
    A2 -- "DoclingDoc" --> B3

    %% Routing to Backend Strategy
    B --> C
    B2 & B3 --> E

    %% Backend Decisions
    C -- VLM --> D
    C -- LLM --> E

    %% LLM Path: Markdown -> Chunking -> Extraction
    E --> F
    F -- Yes --> G
    F -- No --> H

    G --> I
    H --> I

    %% VLM Path: Direct to Validation (Skips Chunking)
    D --> J

    %% LLM Path: Join Validation
    I --> J

    %% Consolidation
    J --> K
    K -- "Rule-Based" --> L
    K -- "AI-Based" --> M

    %% Final Stages
    L --> N
    M --> N

    N --> O
    O --> P
    P --> Q

    %% 4. Apply Classes
    class A input
    class A1,B,I,N process
    class B2,B3,D,E,L,M process
    class A2,C,F,K decision
    class G,H,J,O operator
    class P,Q output

Stage-by-Stage Breakdown

Stage 1: Template Loading

# Load Pydantic template
template = import_template("module.Template")
# Validate structure
validate_template(template)

Stage 2: Document Conversion

# Convert using Docling
doc = processor.convert_to_docling_doc(source)
# Extract markdown
markdown = processor.extract_full_markdown(doc)

Stage 3: Extraction

# Choose backend
if backend == "vlm":
    models = vlm_backend.extract_from_document(source, template)
else:
    models = llm_backend.extract_from_markdown(markdown, template)

Stage 4: Consolidation (if needed)

if len(models) > 1:
    if llm_consolidation:
        final_model = llm_backend.consolidate(models, template)
    else:
        final_model = programmatic_merge(models)

Stage 5: Graph Conversion

# Convert to graph
graph, metadata = converter.pydantic_list_to_graph([final_model])

Stage 6: Export

# Export in multiple formats
csv_exporter.export(graph, output_dir)
cypher_exporter.export(graph, output_dir)
json_exporter.export(graph, output_dir)

Protocol-Based Design

Docling Graph uses Python Protocols for type-safe, flexible interfaces:

class ExtractionBackendProtocol(Protocol):
    """Protocol for extraction backends"""
    def extract_from_document(self, source: str, template: Type[BaseModel]) -> List[BaseModel]: ...

Benefits: Type safety, easy mocking, clear contracts, flexible implementations

Location: docling_graph/config.py

Purpose: Type-safe configuration using Pydantic

class PipelineConfig(BaseModel):
    """Single source of truth for all defaults"""
    source: str
    template: Union[str, Type[BaseModel]]
    backend: Literal["llm", "vlm"] = "llm"
    inference: Literal["local", "remote"] = "local"
    processing_mode: Literal["one-to-one", "many-to-one"] = "many-to-one"
    use_chunking: bool = True
    llm_consolidation: bool = False
    export_format: Literal["csv", "cypher"] = "csv"
    output_dir: str = "outputs"
    # ... additional settings

Error Handling

Location: docling_graph/exceptions.py

Hierarchy:

DoclingGraphError (base)
├── ConfigurationError
├── ClientError
├── ExtractionError
├── ValidationError
├── GraphError
└── PipelineError

Structured Errors:

try:
    run_pipeline(config)
except ClientError as e:
    print(f"Error: {e.message}")
    print(f"Details: {e.details}")
    print(f"Cause: {e.cause}")

Extensibility

Docling Graph is designed for extension:

  • LLM Providers: Implement LLMClientProtocol
  • Pipeline Stages: Implement PipelineStage
  • Export Formats: Extend BaseExporter

See Custom Backends for details.

Now that you understand the architecture:

  1. Installation - Set up your environment
  2. Schema Definition - Create Pydantic templates
  3. Pipeline Configuration - Configure the pipeline