Knowledge Graph Management¶

Overview¶

Knowledge Graph Management covers converting extracted Pydantic models into graph structures, exporting to various formats, and visualizing the results.

What you'll learn: - Graph conversion from Pydantic models - Export formats (CSV, Cypher, JSON) - Visualization techniques - Graph analysis and statistics - Neo4j integration

The Graph Pipeline¶

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    A@{ shape: terminal, label: "Pydantic Models" }

    B@{ shape: procs, label: "Graph Conversion" }
    C@{ shape: doc, label: "NetworkX Graph" }

    D@{ shape: tag-proc, label: "Export" }
    F@{ shape: tag-proc, label: "Visualization" }

    E1@{ shape: doc, label: "CSV Files" }
    E2@{ shape: doc, label: "Cypher Script" }
    E3@{ shape: doc, label: "JSON" }

    G@{ shape: doc, label: "Interactive HTML" }

    %% 3. Define Connections
    A --> B
    B --> C

    C --> D
    C --> F

    D --> E1
    D --> E2
    D --> E3

    F --> G

    %% 4. Apply Classes
    class A input
    class B process
    class C data
    class D,F operator
    class E1,E2,E3,G output

Key Concepts¶

1. Graph Conversion¶

Transform Pydantic models into graph structure:

from docling_graph.core.converters import GraphConverter

converter = GraphConverter()
graph, metadata = converter.pydantic_list_to_graph(models)

print(f"Nodes: {metadata.node_count}")
print(f"Edges: {metadata.edge_count}")

Learn more: Graph Conversion →

2. Export Formats¶

Export graphs in multiple formats:

from docling_graph.core.exporters import CSVExporter, CypherExporter

# CSV export
CSVExporter().export(graph, output_dir)

# Cypher export
CypherExporter().export(graph, output_file)

Learn more: Export Formats →

3. Visualization¶

Generate interactive visualizations:

from docling_graph.core.visualizers import InteractiveVisualizer

visualizer = InteractiveVisualizer()
visualizer.save_cytoscape_graph(graph, "graph.html")

Learn more: Visualization →

4. Neo4j Integration¶

Import graphs into Neo4j:

# Import Cypher script
cat graph.cypher | cypher-shell -u neo4j -p password

Learn more: Neo4j Integration →

Quick Start¶

Complete Pipeline¶

from docling_graph import run_pipeline, PipelineConfig

# Run complete pipeline
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="local",
    export_format="csv",  # or "cypher"
    output_dir="outputs"
)

run_pipeline(config)

# Outputs:
# - outputs/nodes.csv
# - outputs/edges.csv
# - outputs/graph_stats.json
# - outputs/visualization.html

Graph Structure¶

Nodes¶

Nodes represent entities from your Pydantic models:

# Node structure
{
    "id": "invoice_001",
    "label": "BillingDocument",
    "type": "entity",
    "document_no": "INV-001",
    "total": 1000
}

Edges¶

Edges represent relationships between entities:

# Edge structure
{
    "source": "invoice_001",
    "target": "org_acme",
    "label": "ISSUED_BY"
}

Export Formats Comparison¶

Format	Best For	File Type	Use Case
CSV	Analysis, spreadsheets	`.csv`	Data analysis, Excel
Cypher	Neo4j import	`.cypher`	Graph database
JSON	APIs, processing	`.json`	Programmatic access

Section Contents¶

1. Graph Conversion ¶

Learn how Pydantic models are converted to NetworkX graphs.

Topics: - Node creation - Edge generation - Node ID registry - Graph validation - Automatic cleanup

2. Export Formats ¶

Understand different export formats and when to use them.

Topics: - CSV export (nodes and edges) - Cypher export (Neo4j) - JSON export (programmatic) - Format selection

3. Visualization ¶

Generate interactive visualizations of your knowledge graphs.

Topics: - Interactive HTML graphs - Markdown reports - Graph statistics - Customization options

4. Neo4j Integration ¶

Import and query graphs in Neo4j database.

Topics: - Cypher import - Neo4j setup - Query examples - Best practices

5. Graph Analysis ¶

Analyze graph structure and statistics.

Topics: - Node and edge counts - Graph metrics - Connectivity analysis - Quality checks

6. Advanced Topics¶

Advanced graph management techniques covered in other sections.

See also: - Custom Exporters - Performance Tuning - Graph Analysis

Common Workflows¶

Workflow 1: CSV Analysis¶

from docling_graph import run_pipeline, PipelineConfig

# Extract and export to CSV
config = PipelineConfig(
    source="invoices.pdf",
    template="templates.BillingDocument",
    export_format="csv",
    output_dir="analysis"
)

run_pipeline(config)

# Analyze in Python
import pandas as pd

nodes = pd.read_csv("analysis/nodes.csv")
edges = pd.read_csv("analysis/edges.csv")

print(f"Total invoices: {len(nodes[nodes['label'] == 'BillingDocument'])}")

Workflow 2: Neo4j Import¶

from docling_graph import run_pipeline, PipelineConfig

# Extract and export to Cypher
config = PipelineConfig(
    source="contracts.pdf",
    template="templates.Contract",
    export_format="cypher",
    output_dir="neo4j_import"
)

run_pipeline(config)

# Import to Neo4j
# cat neo4j_import/graph.cypher | cypher-shell

Workflow 3: Programmatic Access¶

from docling_graph import run_pipeline, PipelineConfig
import json

# Extract and access programmatically
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    output_dir="data"
)

run_pipeline(config)

# Load graph data
with open("data/graph_data.json") as f:
    graph_data = json.load(f)

# Process nodes
for node in graph_data["nodes"]:
    print(f"{node['type']}: {node['id']}")

Graph Statistics¶

Automatic Statistics¶

Every pipeline run generates statistics:

{
  "node_count": 15,
  "edge_count": 18,
  "node_types": {
    "BillingDocument": 1,
    "Organization": 2,
    "Address": 3,
    "LineItem": 9
  },
  "edge_types": {
    "ISSUED_BY": 1,
    "SENT_TO": 1,
    "LOCATED_AT": 5,
    "CONTAINS_LINE": 9
  },
  "avg_degree": 2.4,
  "density": 0.17
}

Using Statistics¶

import json

# Load statistics
with open("outputs/graph_stats.json") as f:
    stats = json.load(f)

print(f"Graph has {stats['node_count']} nodes")
print(f"Most common node type: {max(stats['node_types'], key=stats['node_types'].get)}")

Visualization Preview¶

Interactive HTML¶

Every pipeline run generates an interactive visualization:

outputs/
└── visualization.html  # Open in browser

Features: - Zoom and pan - Node inspection - Search functionality - Export to image

Best Practices¶

👍 Choose the Right Format¶

# ✅ Good - Match format to use case
if use_case == "neo4j":
    export_format = "cypher"
elif use_case == "analysis":
    export_format = "csv"
else:
    export_format = "csv"  # Default

👍 Validate Graph Structure¶

# ✅ Good - Enable validation
converter = GraphConverter(validate_graph=True)
graph, metadata = converter.pydantic_list_to_graph(models)

👍 Use Automatic Cleanup¶

# ✅ Good - Enable cleanup
converter = GraphConverter(auto_cleanup=True)
graph, metadata = converter.pydantic_list_to_graph(models)

👍 Check Statistics¶

# ✅ Good - Verify graph quality
if metadata.node_count == 0:
    print("Warning: Empty graph")

if metadata.edge_count == 0:
    print("Warning: No relationships")

Troubleshooting¶

🐛 Empty Graph¶

Solution:

# Check if models were extracted
if not models:
    print("No models extracted")

# Check if models have relationships
for model in models:
    print(f"Model: {model}")

🐛 Missing Relationships¶

Solution:

# Ensure entities are properly defined
class Organization(BaseModel):
    name: str
    # Must be entity to create nodes
    model_config = {"is_entity": True}

🐛 Export Fails¶

Solution:

# Check output directory exists
import os
os.makedirs("outputs", exist_ok=True)

# Check graph is not empty
if graph.number_of_nodes() == 0:
    print("Cannot export empty graph")

Next Steps¶

Ready to dive deeper? Start with:

Graph Conversion → - Learn graph conversion
Export Formats → - Choose export format
Visualization → - Visualize your graphs

Knowledge Graph Management¶

Overview¶

The Graph Pipeline¶

Key Concepts¶

1. Graph Conversion¶

2. Export Formats¶

3. Visualization¶

4. Neo4j Integration¶

Quick Start¶

Complete Pipeline¶

Graph Structure¶

Nodes¶

Edges¶

Export Formats Comparison¶

Section Contents¶

1. Graph Conversion¶

2. Export Formats¶

3. Visualization¶

4. Neo4j Integration¶

5. Graph Analysis¶

6. Advanced Topics¶

Common Workflows¶

Workflow 1: CSV Analysis¶

Workflow 2: Neo4j Import¶

Workflow 3: Programmatic Access¶

Graph Statistics¶

Automatic Statistics¶

Using Statistics¶

Visualization Preview¶

Interactive HTML¶

Best Practices¶

👍 Choose the Right Format¶

👍 Validate Graph Structure¶

👍 Use Automatic Cleanup¶

👍 Check Statistics¶

Troubleshooting¶

🐛 Empty Graph¶

🐛 Missing Relationships¶

🐛 Export Fails¶

Next Steps¶

1. Graph Conversion ¶

2. Export Formats ¶

3. Visualization ¶

4. Neo4j Integration ¶

5. Graph Analysis ¶