Export Formats¶
Overview¶
Export formats determine how your knowledge graph is saved and shared. Docling Graph supports CSV, Cypher, and JSON formats, each optimized for different use cases.
In this guide: - CSV format (spreadsheets, analysis) - Cypher format (Neo4j import) - JSON format (programmatic access) - Format selection criteria - Integration examples
Format Comparison¶
| Format | Best For | Output | Use Case |
|---|---|---|---|
| CSV | Analysis, spreadsheets | nodes.csv, edges.csv |
Excel, Pandas, SQL |
| Cypher | Graph databases | graph.cypher |
Neo4j import |
| JSON | APIs, processing | graph.json |
Python, JavaScript |
CSV Export¶
What is CSV Export?¶
CSV export creates separate files for nodes and edges in comma-separated format, perfect for spreadsheet analysis and SQL databases.
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
export_format="csv", # CSV export (default)
output_dir="outputs"
)
run_pipeline(config)
Output Files¶
outputs/
├── nodes.csv # All nodes with properties
├── edges.csv # All edges with relationships
├── graph_stats.json # Graph statistics
└── visualization.html # Interactive visualization
nodes.csv Format¶
id,label,type,__class__,invoice_number,total,name,street,city
invoice_001,Invoice,entity,Invoice,INV-001,1000,,,
org_acme,Organization,entity,Organization,,,Acme Corp,,
addr_123,Address,entity,Address,,,,123 Main St,Paris
Columns:
- id: Unique node identifier
- label: Node type/class
- type: Always "entity"
- __class__: Python class name
- Additional columns for each property
edges.csv Format¶
source,target,label
invoice_001,org_acme,issued_by
org_acme,addr_123,located_at
invoice_001,item_001,contains_item
Columns:
- source: Source node ID
- target: Target node ID
- label: Relationship type
Manual CSV Export¶
from docling_graph.core.exporters import CSVExporter
from docling_graph.core.converters import GraphConverter
# Convert models to graph
converter = GraphConverter()
graph, metadata = converter.pydantic_list_to_graph(models)
# Export to CSV
exporter = CSVExporter()
exporter.export(graph, output_dir="csv_output")
print("Exported to csv_output/nodes.csv and csv_output/edges.csv")
Using CSV with Pandas¶
import pandas as pd
# Load CSV files
nodes = pd.read_csv("outputs/nodes.csv")
edges = pd.read_csv("outputs/edges.csv")
# Analyze nodes
print(f"Total nodes: {len(nodes)}")
print(f"Node types:\n{nodes['label'].value_counts()}")
# Analyze edges
print(f"Total edges: {len(edges)}")
print(f"Edge types:\n{edges['label'].value_counts()}")
# Filter specific node type
invoices = nodes[nodes['label'] == 'Invoice']
print(f"Found {len(invoices)} invoices")
Using CSV with SQL¶
import sqlite3
import pandas as pd
# Load CSV
nodes = pd.read_csv("outputs/nodes.csv")
edges = pd.read_csv("outputs/edges.csv")
# Create database
conn = sqlite3.connect("graph.db")
# Import to SQL
nodes.to_sql("nodes", conn, if_exists="replace", index=False)
edges.to_sql("edges", conn, if_exists="replace", index=False)
# Query
result = pd.read_sql("""
SELECT n.label, COUNT(*) as count
FROM nodes n
GROUP BY n.label
""", conn)
print(result)
Cypher Export¶
What is Cypher Export?¶
Cypher export generates Cypher statements for direct import into Neo4j graph databases.
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
export_format="cypher", # Cypher export
output_dir="outputs"
)
run_pipeline(config)
Output Files¶
outputs/
├── graph.cypher # Cypher statements
├── graph_stats.json # Graph statistics
└── visualization.html # Interactive visualization
graph.cypher Format¶
// Cypher script generated by docling-graph
// Import this into Neo4j
// --- Create Nodes ---
CREATE (invoice_001:Invoice {invoice_number: "INV-001", total: 1000, node_id: "invoice_001"})
CREATE (org_acme:Organization {name: "Acme Corp", node_id: "org_acme"})
CREATE (addr_123:Address {street: "123 Main St", city: "Paris", node_id: "addr_123"})
// --- Create Relationships ---
MATCH (invoice_001), (org_acme)
CREATE (invoice_001)-[:ISSUED_BY]->(org_acme)
MATCH (org_acme), (addr_123)
CREATE (org_acme)-[:LOCATED_AT]->(addr_123)
Manual Cypher Export¶
from docling_graph.core.exporters import CypherExporter
from docling_graph.core.converters import GraphConverter
from pathlib import Path
# Convert models to graph
converter = GraphConverter()
graph, metadata = converter.pydantic_list_to_graph(models)
# Export to Cypher
exporter = CypherExporter()
exporter.export(graph, Path("outputs/graph.cypher"))
print("Exported to outputs/graph.cypher")
Importing to Neo4j¶
Method 1: cypher-shell¶
# Import using cypher-shell
cat outputs/graph.cypher | cypher-shell -u neo4j -p password
# Or with file
cypher-shell -u neo4j -p password -f outputs/graph.cypher
Method 2: Neo4j Browser¶
- Open Neo4j Browser (http://localhost:7474)
- Copy contents of
graph.cypher - Paste into query editor
- Execute
Method 3: Python Driver¶
from neo4j import GraphDatabase
# Connect to Neo4j
driver = GraphDatabase.driver(
"bolt://localhost:7687",
auth=("neo4j", "password")
)
# Read Cypher file
with open("outputs/graph.cypher") as f:
cypher_script = f.read()
# Execute
with driver.session() as session:
session.run(cypher_script)
driver.close()
print("Imported to Neo4j")
JSON Export¶
What is JSON Export?¶
JSON export is automatically generated alongside CSV or Cypher, providing structured data for programmatic access.
Output Files¶
outputs/
├── extracted_data.json # Pydantic models
├── graph_data.json # Graph structure
├── graph_stats.json # Statistics
└── ...
extracted_data.json Format¶
{
"models": [
{
"invoice_number": "INV-001",
"total": 1000,
"issued_by": {
"name": "Acme Corp",
"located_at": {
"street": "123 Main St",
"city": "Paris"
}
}
}
]
}
graph_data.json Format¶
{
"nodes": [
{
"id": "invoice_001",
"label": "Invoice",
"type": "entity",
"properties": {
"invoice_number": "INV-001",
"total": 1000
}
},
{
"id": "org_acme",
"label": "Organization",
"type": "entity",
"properties": {
"name": "Acme Corp"
}
}
],
"edges": [
{
"source": "invoice_001",
"target": "org_acme",
"label": "issued_by"
}
]
}
Manual JSON Export¶
from docling_graph.core.exporters import JSONExporter
from docling_graph.core.converters import GraphConverter
from pathlib import Path
# Convert models to graph
converter = GraphConverter()
graph, metadata = converter.pydantic_list_to_graph(models)
# Export to JSON
exporter = JSONExporter()
exporter.export(graph, Path("outputs/graph.json"))
print("Exported to outputs/graph.json")
Using JSON in Python¶
import json
# Load graph data
with open("outputs/graph_data.json") as f:
graph_data = json.load(f)
# Access nodes
for node in graph_data["nodes"]:
print(f"{node['label']}: {node['id']}")
# Access edges
for edge in graph_data["edges"]:
print(f"{edge['source']} --[{edge['label']}]--> {edge['target']}")
# Filter by type
invoices = [n for n in graph_data["nodes"] if n["label"] == "Invoice"]
print(f"Found {len(invoices)} invoices")
Format Selection¶
Decision Matrix¶
| Use Case | Recommended Format | Reason |
|---|---|---|
| Excel analysis | CSV | Direct import to Excel |
| Neo4j database | Cypher | Direct import |
| Python processing | JSON | Easy to parse |
| SQL database | CSV | Standard import |
| Data science | CSV | Pandas compatible |
| API integration | JSON | Standard format |
| Graph queries | Cypher | Neo4j native |
By Tool¶
| Tool | Format | Import Method |
|---|---|---|
| Excel | CSV | File → Open |
| Neo4j | Cypher | cypher-shell |
| Python | JSON | json.load() |
| Pandas | CSV | pd.read_csv() |
| SQL | CSV | COPY/LOAD DATA |
| Power BI | CSV | Get Data |
| Tableau | CSV | Connect to File |
Complete Examples¶
📍 CSV for Analysis¶
from docling_graph import run_pipeline, PipelineConfig
import pandas as pd
# Extract and export to CSV
config = PipelineConfig(
source="invoices.pdf",
template="templates.BillingDocument",
export_format="csv",
output_dir="analysis"
)
run_pipeline(config)
# Analyze with Pandas
nodes = pd.read_csv("analysis/nodes.csv")
edges = pd.read_csv("analysis/edges.csv")
# Calculate statistics
print(f"Total invoices: {len(nodes[nodes['label'] == 'Invoice'])}")
print(f"Total organizations: {len(nodes[nodes['label'] == 'Organization'])}")
print(f"Total relationships: {len(edges)}")
# Export summary
summary = nodes.groupby('label').size()
summary.to_csv("analysis/summary.csv")
📍 Cypher for Neo4j¶
from docling_graph import run_pipeline, PipelineConfig
import subprocess
# Extract and export to Cypher
config = PipelineConfig(
source="contracts.pdf",
template="templates.Contract",
export_format="cypher",
output_dir="neo4j_import"
)
run_pipeline(config)
# Import to Neo4j
result = subprocess.run([
"cypher-shell",
"-u", "neo4j",
"-p", "password",
"-f", "neo4j_import/graph.cypher"
], capture_output=True, text=True)
if result.returncode == 0:
print("✅ Successfully imported to Neo4j")
else:
print(f"❌ Import failed: {result.stderr}")
📍 JSON for API¶
from docling_graph import run_pipeline, PipelineConfig
import json
import requests
# Extract and export
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
export_format="csv", # JSON always generated
output_dir="api_data"
)
run_pipeline(config)
# Load JSON
with open("api_data/extracted_data.json") as f:
data = json.load(f)
# Send to API
response = requests.post(
"https://api.example.com/invoices",
json=data,
headers={"Content-Type": "application/json"}
)
print(f"API response: {response.status_code}")
Best Practices¶
👍 Choose Format by Use Case¶
# ✅ Good - Match format to use case
if use_case == "neo4j":
export_format = "cypher"
elif use_case == "analysis":
export_format = "csv"
else:
export_format = "csv" # Default
👍 Organize Output Directories¶
# ✅ Good - Structured outputs
from datetime import datetime
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_dir = f"exports/{export_format}/{timestamp}"
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
export_format=export_format,
output_dir=output_dir
)
👍 Validate Exports¶
# ✅ Good - Check exports exist
import os
run_pipeline(config)
if export_format == "csv":
assert os.path.exists(f"{output_dir}/nodes.csv")
assert os.path.exists(f"{output_dir}/edges.csv")
elif export_format == "cypher":
assert os.path.exists(f"{output_dir}/graph.cypher")
print("✅ Exports validated")
Troubleshooting¶
🐛 Empty CSV Files¶
Solution:
# Check if graph has nodes
import json
with open("outputs/graph_stats.json") as f:
stats = json.load(f)
if stats["node_count"] == 0:
print("No nodes in graph - check extraction")
🐛 Cypher Import Fails¶
Solution:
# Check Cypher syntax
head -20 outputs/graph.cypher
# Test connection
cypher-shell -u neo4j -p password "RETURN 1"
# Import with error logging
cat outputs/graph.cypher | cypher-shell -u neo4j -p password 2>&1 | tee import.log
🐛 JSON Parsing Error¶
Solution:
# Validate JSON
import json
try:
with open("outputs/graph_data.json") as f:
data = json.load(f)
print("✅ Valid JSON")
except json.JSONDecodeError as e:
print(f"❌ Invalid JSON: {e}")
Next Steps¶
Now that you understand export formats:
- Visualization → - Visualize your graphs
- Neo4j Integration → - Deep dive into Neo4j
- Graph Analysis → - Analyze graph structure