Skip to content

Knowledge Graph Management

Overview

Knowledge Graph Management covers converting extracted Pydantic models into graph structures, exporting to various formats, and visualizing the results.

What you'll learn: - Graph conversion from Pydantic models - Export formats (CSV, Cypher, JSON) - Visualization techniques - Graph analysis and statistics - Neo4j integration


The Graph Pipeline

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    A@{ shape: terminal, label: "Pydantic Models" }

    B@{ shape: procs, label: "Graph Conversion" }
    C@{ shape: doc, label: "NetworkX Graph" }

    D@{ shape: tag-proc, label: "Export" }
    F@{ shape: tag-proc, label: "Visualization" }

    E1@{ shape: doc, label: "CSV Files" }
    E2@{ shape: doc, label: "Cypher Script" }
    E3@{ shape: doc, label: "JSON" }

    G@{ shape: doc, label: "Interactive HTML" }

    %% 3. Define Connections
    A --> B
    B --> C

    C --> D
    C --> F

    D --> E1
    D --> E2
    D --> E3

    F --> G

    %% 4. Apply Classes
    class A input
    class B process
    class C data
    class D,F operator
    class E1,E2,E3,G output

Key Concepts

1. Graph Conversion

Transform Pydantic models into graph structure:

from docling_graph.core.converters import GraphConverter

converter = GraphConverter()
graph, metadata = converter.pydantic_list_to_graph(models)

print(f"Nodes: {metadata.node_count}")
print(f"Edges: {metadata.edge_count}")

Learn more: Graph Conversion →


2. Export Formats

Export graphs in multiple formats:

from docling_graph.core.exporters import CSVExporter, CypherExporter

# CSV export
CSVExporter().export(graph, output_dir)

# Cypher export
CypherExporter().export(graph, output_file)

Learn more: Export Formats →


3. Visualization

Generate interactive visualizations:

from docling_graph.core.visualizers import InteractiveVisualizer

visualizer = InteractiveVisualizer()
visualizer.save_cytoscape_graph(graph, "graph.html")

Learn more: Visualization →


4. Neo4j Integration

Import graphs into Neo4j:

# Import Cypher script
cat graph.cypher | cypher-shell -u neo4j -p password

Learn more: Neo4j Integration →


Quick Start

Complete Pipeline

from docling_graph import run_pipeline, PipelineConfig

# Run complete pipeline
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="local",
    export_format="csv",  # or "cypher"
    output_dir="outputs"
)

run_pipeline(config)

# Outputs:
# - outputs/nodes.csv
# - outputs/edges.csv
# - outputs/graph_stats.json
# - outputs/visualization.html

Graph Structure

Nodes

Nodes represent entities from your Pydantic models:

# Node structure
{
    "id": "invoice_001",
    "label": "BillingDocument",
    "type": "entity",
    "document_no": "INV-001",
    "total": 1000
}

Edges

Edges represent relationships between entities:

# Edge structure
{
    "source": "invoice_001",
    "target": "org_acme",
    "label": "ISSUED_BY"
}

Export Formats Comparison

Format Best For File Type Use Case
CSV Analysis, spreadsheets .csv Data analysis, Excel
Cypher Neo4j import .cypher Graph database
JSON APIs, processing .json Programmatic access

Section Contents

1. Graph Conversion

Learn how Pydantic models are converted to NetworkX graphs.

Topics: - Node creation - Edge generation - Node ID registry - Graph validation - Automatic cleanup


2. Export Formats

Understand different export formats and when to use them.

Topics: - CSV export (nodes and edges) - Cypher export (Neo4j) - JSON export (programmatic) - Format selection


3. Visualization

Generate interactive visualizations of your knowledge graphs.

Topics: - Interactive HTML graphs - Markdown reports - Graph statistics - Customization options


4. Neo4j Integration

Import and query graphs in Neo4j database.

Topics: - Cypher import - Neo4j setup - Query examples - Best practices


5. Graph Analysis

Analyze graph structure and statistics.

Topics: - Node and edge counts - Graph metrics - Connectivity analysis - Quality checks


6. Advanced Topics

Advanced graph management techniques covered in other sections.

See also: - Custom Exporters - Performance Tuning - Graph Analysis


Common Workflows

Workflow 1: CSV Analysis

from docling_graph import run_pipeline, PipelineConfig

# Extract and export to CSV
config = PipelineConfig(
    source="invoices.pdf",
    template="templates.BillingDocument",
    export_format="csv",
    output_dir="analysis"
)

run_pipeline(config)

# Analyze in Python
import pandas as pd

nodes = pd.read_csv("analysis/nodes.csv")
edges = pd.read_csv("analysis/edges.csv")

print(f"Total invoices: {len(nodes[nodes['label'] == 'BillingDocument'])}")

Workflow 2: Neo4j Import

from docling_graph import run_pipeline, PipelineConfig

# Extract and export to Cypher
config = PipelineConfig(
    source="contracts.pdf",
    template="templates.Contract",
    export_format="cypher",
    output_dir="neo4j_import"
)

run_pipeline(config)

# Import to Neo4j
# cat neo4j_import/graph.cypher | cypher-shell

Workflow 3: Programmatic Access

from docling_graph import run_pipeline, PipelineConfig
import json

# Extract and access programmatically
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    output_dir="data"
)

run_pipeline(config)

# Load graph data
with open("data/graph_data.json") as f:
    graph_data = json.load(f)

# Process nodes
for node in graph_data["nodes"]:
    print(f"{node['type']}: {node['id']}")

Graph Statistics

Automatic Statistics

Every pipeline run generates statistics:

{
  "node_count": 15,
  "edge_count": 18,
  "node_types": {
    "BillingDocument": 1,
    "Organization": 2,
    "Address": 3,
    "LineItem": 9
  },
  "edge_types": {
    "ISSUED_BY": 1,
    "SENT_TO": 1,
    "LOCATED_AT": 5,
    "CONTAINS_LINE": 9
  },
  "avg_degree": 2.4,
  "density": 0.17
}

Using Statistics

import json

# Load statistics
with open("outputs/graph_stats.json") as f:
    stats = json.load(f)

print(f"Graph has {stats['node_count']} nodes")
print(f"Most common node type: {max(stats['node_types'], key=stats['node_types'].get)}")

Visualization Preview

Interactive HTML

Every pipeline run generates an interactive visualization:

outputs/
└── visualization.html  # Open in browser

Features: - Zoom and pan - Node inspection - Search functionality - Export to image


Best Practices

👍 Choose the Right Format

# ✅ Good - Match format to use case
if use_case == "neo4j":
    export_format = "cypher"
elif use_case == "analysis":
    export_format = "csv"
else:
    export_format = "csv"  # Default

👍 Validate Graph Structure

# ✅ Good - Enable validation
converter = GraphConverter(validate_graph=True)
graph, metadata = converter.pydantic_list_to_graph(models)

👍 Use Automatic Cleanup

# ✅ Good - Enable cleanup
converter = GraphConverter(auto_cleanup=True)
graph, metadata = converter.pydantic_list_to_graph(models)

👍 Check Statistics

# ✅ Good - Verify graph quality
if metadata.node_count == 0:
    print("Warning: Empty graph")

if metadata.edge_count == 0:
    print("Warning: No relationships")

Troubleshooting

🐛 Empty Graph

Solution:

# Check if models were extracted
if not models:
    print("No models extracted")

# Check if models have relationships
for model in models:
    print(f"Model: {model}")

🐛 Missing Relationships

Solution:

# Ensure entities are properly defined
class Organization(BaseModel):
    name: str
    # Must be entity to create nodes
    model_config = {"is_entity": True}

🐛 Export Fails

Solution:

# Check output directory exists
import os
os.makedirs("outputs", exist_ok=True)

# Check graph is not empty
if graph.number_of_nodes() == 0:
    print("Cannot export empty graph")


Next Steps

Ready to dive deeper? Start with:

  1. Graph Conversion → - Learn graph conversion
  2. Export Formats → - Choose export format
  3. Visualization → - Visualize your graphs