Graph Conversion¶
Overview¶
Graph conversion transforms Pydantic models into NetworkX directed graphs, creating nodes for entities and edges for relationships. This is the foundation of knowledge graph creation.
In this guide: - Conversion process - Node and edge creation - Node ID registry - Graph validation - Automatic cleanup
Conversion Process¶
High-Level Flow¶
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
%% 1. Define Classes
classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
A@{ shape: terminal, label: "Pydantic Models" }
B@{ shape: lin-proc, label: "Pre-register Models" }
C@{ shape: procs, label: "Create Nodes" }
D@{ shape: procs, label: "Create Edges" }
E@{ shape: tag-proc, label: "Auto Cleanup" }
F@{ shape: tag-proc, label: "Validate Graph" }
G@{ shape: tag-proc, label: "Calculate Stats" }
H@{ shape: doc, label: "NetworkX Graph" }
%% 3. Define Connections
A --> B
B --> C
C --> D
D --> E
E --> F
F --> G
G --> H
%% 4. Apply Classes
class A input
class B,C,D process
class E,F,G operator
class H output
GraphConverter¶
Basic Usage¶
from docling_graph.core.converters import GraphConverter
# Create converter
converter = GraphConverter()
# Convert models to graph
graph, metadata = converter.pydantic_list_to_graph(models)
print(f"Created graph with {metadata.node_count} nodes and {metadata.edge_count} edges")
With Configuration¶
from docling_graph.core.converters import GraphConverter
converter = GraphConverter(
add_reverse_edges=False, # Don't create bidirectional edges
validate_graph=True, # Validate structure
auto_cleanup=True # Remove phantom nodes
)
graph, metadata = converter.pydantic_list_to_graph(models)
Node Creation¶
What Becomes a Node?¶
Entities (models with is_entity=True) become nodes:
from pydantic import BaseModel
# ✅ Becomes a node
class Organization(BaseModel):
name: str
model_config = {"is_entity": True} # Default
# ❌ Does NOT become a node
class Address(BaseModel):
street: str
city: str
model_config = {"is_entity": False} # Component
Node Structure¶
# Node in graph
{
"id": "organization_acme_corp",
"label": "Organization",
"type": "entity",
"__class__": "Organization",
"name": "Acme Corp",
"address": None # Reference to nested entity
}
Edge Creation¶
Automatic Edge Generation¶
Edges are created automatically from model relationships:
class BillingDocument(BaseModel):
document_no: str
issued_by: Organization # Creates edge: BillingDocument -> Organization
line_items: List[LineItem] # Creates edges: BillingDocument -> LineItem (multiple)
Edge Structure¶
# Edge in graph
{
"source": "invoice_001",
"target": "organization_acme_corp",
"label": "issued_by",
"properties": {}
}
Custom Edge Labels¶
from pydantic import BaseModel, Field
class BillingDocument(BaseModel):
issued_by: Organization = Field(
json_schema_extra={"edge_label": "ISSUED_BY"}
)
Result: Edge label becomes ISSUED_BY instead of issued_by
Node ID Registry¶
What is Node ID Registry?¶
The NodeIDRegistry ensures consistent, deterministic node IDs across multiple extractions.
How It Works¶
# Same entity always gets same ID
org1 = Organization(name="Acme Corp")
org2 = Organization(name="Acme Corp")
# Both get ID: "organization_acme_corp"
id1 = registry.get_node_id(org1)
id2 = registry.get_node_id(org2)
assert id1 == id2 # True
ID Generation¶
def generate_node_id(model: BaseModel) -> str:
"""Generate deterministic node ID."""
class_name = model.__class__.__name__.lower()
# Use stable fields for identity
stable_fields = {
k: v for k, v in model.model_dump().items()
if k not in {"id", "__class__"} and v is not None
}
# Create content hash
content = json.dumps(stable_fields, sort_keys=True)
hash_suffix = hashlib.blake2b(content.encode()).hexdigest()[:8]
return f"{class_name}_{hash_suffix}"
Graph Validation¶
Automatic Validation¶
Validation checks graph structure:
converter = GraphConverter(validate_graph=True)
graph, metadata = converter.pydantic_list_to_graph(models)
# Validates:
# - No isolated nodes
# - Valid node IDs
# - Valid edge connections
# - No self-loops (optional)
Manual Validation¶
from docling_graph.core.utils import validate_graph_structure
try:
validate_graph_structure(graph, raise_on_error=True)
print("✅ Graph structure valid")
except ValueError as e:
print(f"❌ Validation failed: {e}")
Automatic Cleanup¶
What Gets Cleaned?¶
Automatic cleanup removes:
- Phantom nodes - Nodes with no data
- Duplicate nodes - Same entity multiple times
- Orphaned edges - Edges to non-existent nodes
- Empty attributes - Null or empty values
Configuration¶
converter = GraphConverter(
auto_cleanup=True # Enable cleanup (default)
)
graph, metadata = converter.pydantic_list_to_graph(models)
Manual Cleanup¶
from docling_graph.core.utils import GraphCleaner
cleaner = GraphCleaner(verbose=True)
cleaned_graph = cleaner.clean_graph(graph)
print(f"Removed {graph.number_of_nodes() - cleaned_graph.number_of_nodes()} phantom nodes")
Complete Examples¶
📍 Basic Conversion¶
from docling_graph.core.converters import GraphConverter
from my_templates import BillingDocument, Organization, LineItem
# Create sample models
models = [
BillingDocument(
document_no="INV-001",
issued_by=Organization(name="Acme Corp"),
line_items=[
LineItem(description="Product A", total=100),
LineItem(description="Product B", total=200)
],
total=300
)
]
# Convert to graph
converter = GraphConverter()
graph, metadata = converter.pydantic_list_to_graph(models)
print(f"Nodes: {metadata.node_count}")
print(f"Edges: {metadata.edge_count}")
print(f"Node types: {metadata.node_types}")
📍 With Reverse Edges¶
from docling_graph.core.converters import GraphConverter
# Create bidirectional edges
converter = GraphConverter(add_reverse_edges=True)
graph, metadata = converter.pydantic_list_to_graph(models)
# Original edge: BillingDocument -> Organization (ISSUED_BY)
# Reverse edge: Organization -> Invoice (reverse_ISSUED_BY)
print(f"Total edges (with reverse): {metadata.edge_count}")
📍 Shared Registry for Batches¶
from docling_graph.core.converters import GraphConverter, NodeIDRegistry
# Create shared registry
registry = NodeIDRegistry()
# Convert first batch
converter1 = GraphConverter(registry=registry)
graph1, _ = converter1.pydantic_list_to_graph(batch1_models)
# Convert second batch (same registry)
converter2 = GraphConverter(registry=registry)
graph2, _ = converter2.pydantic_list_to_graph(batch2_models)
# Same entities get same IDs across batches
print(f"Registry has {registry.get_stats()['total_entities']} unique entities")
📍 Custom Configuration¶
from docling_graph.core.converters import GraphConverter, GraphConfig
# Create custom config
config = GraphConfig(
add_reverse_edges=True,
validate_graph=True,
node_id_prefix="doc_"
)
converter = GraphConverter(config=config)
graph, metadata = converter.pydantic_list_to_graph(models)
Graph Metadata¶
Metadata Structure¶
@dataclass
class GraphMetadata:
node_count: int
edge_count: int
node_types: Dict[str, int]
edge_types: Dict[str, int]
avg_degree: float
density: float
source_model_count: int
Using Metadata¶
graph, metadata = converter.pydantic_list_to_graph(models)
print(f"Graph Statistics:")
print(f" Nodes: {metadata.node_count}")
print(f" Edges: {metadata.edge_count}")
print(f" Density: {metadata.density:.2f}")
print(f" Avg degree: {metadata.avg_degree:.2f}")
print(f"\nNode Types:")
for node_type, count in metadata.node_types.items():
print(f" {node_type}: {count}")
print(f"\nEdge Types:")
for edge_type, count in metadata.edge_types.items():
print(f" {edge_type}: {count}")
Advanced Features¶
Reverse Edges¶
Create bidirectional relationships:
converter = GraphConverter(add_reverse_edges=True)
graph, metadata = converter.pydantic_list_to_graph(models)
# For each edge A -> B, creates B -> A
# Useful for graph traversal in both directions
Custom Node IDs¶
Provide custom node ID logic:
from docling_graph.core.converters import NodeIDRegistry
class CustomRegistry(NodeIDRegistry):
def generate_node_id(self, model: BaseModel) -> str:
# Custom ID generation
return f"custom_{model.__class__.__name__}_{hash(model)}"
registry = CustomRegistry()
converter = GraphConverter(registry=registry)
Performance Optimization¶
Batch Processing¶
# Process large model lists efficiently
converter = GraphConverter(auto_cleanup=True)
# Convert in single call (efficient)
graph, metadata = converter.pydantic_list_to_graph(all_models)
# Don't convert one by one (inefficient)
# for model in models:
# graph, _ = converter.pydantic_list_to_graph([model])
Memory Management¶
# For very large graphs
converter = GraphConverter(
auto_cleanup=True, # Remove unnecessary nodes
validate_graph=False # Skip validation for speed
)
graph, metadata = converter.pydantic_list_to_graph(models)
# Clear registry after conversion
converter.registry.clear()
Troubleshooting¶
🐛 Empty Graph¶
Solution:
# Check if models have entities
for model in models:
if hasattr(model, 'model_config'):
is_entity = model.model_config.get('is_entity', True)
print(f"{model.__class__.__name__}: is_entity={is_entity}")
🐛 Missing Edges¶
Solution:
# Ensure relationships are defined
class BillingDocument(BaseModel):
issued_by: Organization # Must be typed as entity
# Not: issued_by: dict # Won't create edge
🐛 Duplicate Nodes¶
Solution:
# Enable auto cleanup
converter = GraphConverter(auto_cleanup=True)
graph, metadata = converter.pydantic_list_to_graph(models)
🐛 Validation Fails¶
Solution:
# Check graph structure
print(f"Nodes: {graph.number_of_nodes()}")
print(f"Edges: {graph.number_of_edges()}")
# Inspect nodes
for node_id, data in list(graph.nodes(data=True))[:5]:
print(f"Node: {node_id}, Data: {data}")
Best Practices¶
👍 Use Shared Registry for Batches¶
# ✅ Good - Consistent IDs across batches
registry = NodeIDRegistry()
for batch in batches:
converter = GraphConverter(registry=registry)
graph, _ = converter.pydantic_list_to_graph(batch)
👍 Enable Auto Cleanup¶
👍 Validate in Development¶
👍 Disable Validation in Production¶
Next Steps¶
Now that you understand graph conversion:
- Export Formats → - Export graphs to CSV, Cypher, JSON
- Visualization → - Visualize your graphs
- Neo4j Integration → - Import into Neo4j