Neo4j Integration¶
Overview¶
Neo4j integration enables you to import knowledge graphs into Neo4j graph database for powerful querying, analysis, and visualization using Cypher query language.
In this guide: - Neo4j setup - Cypher import - Query examples - Best practices - Troubleshooting
Why Neo4j?¶
Benefits¶
✅ Graph-native database - Optimized for graph queries - Fast relationship traversal - ACID transactions
✅ Cypher query language - Intuitive pattern matching - Powerful aggregations - Path finding algorithms
✅ Visualization - Built-in graph browser - Interactive exploration - Custom styling
✅ Scalability - Handles millions of nodes - Distributed architecture - High performance
Neo4j Setup¶
Installation¶
Option 1: Neo4j Desktop (Recommended)¶
# Download from https://neo4j.com/download/
# Install and create a new database
# Default credentials: neo4j/neo4j (change on first login)
Option 2: Docker¶
# Run Neo4j in Docker
docker run \
--name neo4j \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/password \
neo4j:latest
# Access at http://localhost:7474
Option 3: Cloud (Neo4j Aura)¶
# Sign up at https://neo4j.com/cloud/aura/
# Create free instance
# Note connection URI and credentials
Verify Installation¶
# Check Neo4j is running
curl http://localhost:7474
# Test cypher-shell
cypher-shell -u neo4j -p password "RETURN 1"
Exporting for Neo4j¶
Generate Cypher Script¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
export_format="cypher", # Generate Cypher script
output_dir="neo4j_import"
)
run_pipeline(config)
# Generates: neo4j_import/graph.cypher
Importing to Neo4j¶
Method 1: cypher-shell (Recommended)¶
# Import Cypher script
cat neo4j_import/graph.cypher | cypher-shell -u neo4j -p password
# Or with file
cypher-shell -u neo4j -p password -f neo4j_import/graph.cypher
# With error logging
cat neo4j_import/graph.cypher | cypher-shell -u neo4j -p password 2>&1 | tee import.log
Method 2: Neo4j Browser¶
- Open Neo4j Browser (http://localhost:7474)
- Login with credentials
- Open
graph.cypherfile - Copy contents
- Paste into query editor
- Click "Run" or press Ctrl+Enter
Method 3: Python Driver¶
from neo4j import GraphDatabase
# Connect to Neo4j
driver = GraphDatabase.driver(
"bolt://localhost:7687",
auth=("neo4j", "password")
)
# Read Cypher file
with open("neo4j_import/graph.cypher") as f:
cypher_script = f.read()
# Execute
with driver.session() as session:
session.run(cypher_script)
driver.close()
print("✅ Imported to Neo4j")
Method 4: Automated Import¶
from docling_graph import run_pipeline, PipelineConfig
import subprocess
# Extract and export
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
export_format="cypher",
output_dir="neo4j_import"
)
run_pipeline(config)
# Import to Neo4j
result = subprocess.run([
"cypher-shell",
"-u", "neo4j",
"-p", "password",
"-f", "neo4j_import/graph.cypher"
], capture_output=True, text=True)
if result.returncode == 0:
print("✅ Successfully imported to Neo4j")
else:
print(f"❌ Import failed: {result.stderr}")
Querying Neo4j¶
Basic Queries¶
Count Nodes¶
// Count all nodes
MATCH (n)
RETURN count(n) as total_nodes
// Count by type
MATCH (n)
RETURN labels(n) as type, count(n) as count
ORDER BY count DESC
Count Relationships¶
// Count all relationships
MATCH ()-[r]->()
RETURN count(r) as total_relationships
// Count by type
MATCH ()-[r]->()
RETURN type(r) as relationship_type, count(r) as count
ORDER BY count DESC
Finding Nodes¶
Find Specific Node¶
// Find invoice by number
MATCH (i:BillingDocument {document_no: "INV-001"})
RETURN i
// Find organization by name
MATCH (o:Organization {name: "Acme Corp"})
RETURN o
Find All of Type¶
// Find all invoices
MATCH (i:BillingDocument)
RETURN i
LIMIT 10
// Find all organizations
MATCH (o:Organization)
RETURN o.name, o.address
Relationship Queries¶
Direct Relationships¶
// Find who issued an invoice
MATCH (i:BillingDocument {document_no: "INV-001"})-[:ISSUED_BY]->(o:Organization)
RETURN i.document_no, o.name
// Find all line items in an invoice
MATCH (i:BillingDocument)-[:CONTAINS_LINE]->(item:LineItem)
WHERE i.document_no = "INV-001"
RETURN item.description, item.total
Multi-Hop Relationships¶
// Find invoice -> organization -> address
MATCH (i:BillingDocument)-[:ISSUED_BY]->(o:Organization)-[:LOCATED_AT]->(a:Address)
RETURN i.document_no, o.name, a.city
// Find all paths between two nodes
MATCH path = (start:BillingDocument)-[*..3]-(end:Address)
WHERE start.document_no = "INV-001"
RETURN path
Aggregation Queries¶
Sum and Average¶
// Total invoice amount
MATCH (i:BillingDocument)
RETURN sum(i.total) as total_amount
// Average invoice amount
MATCH (i:BillingDocument)
RETURN avg(i.total) as average_amount
// Count invoices per organization
MATCH (o:Organization)<-[:ISSUED_BY]-(i:BillingDocument)
RETURN o.name, count(i) as invoice_count
ORDER BY invoice_count DESC
Pattern Matching¶
Complex Patterns¶
// Find invoices with specific pattern
MATCH (i:BillingDocument)-[:ISSUED_BY]->(o:Organization),
(i)-[:SENT_TO]->(c:Organization),
(i)-[:CONTAINS_LINE]->(item:LineItem)
WHERE i.total > 1000
RETURN i, o, c, collect(item) as items
// Find organizations that both issue and receive invoices
MATCH (o:Organization)<-[:ISSUED_BY]-(i1:BillingDocument),
(o)<-[:SENT_TO]-(i2:BillingDocument)
RETURN o.name, count(DISTINCT i1) as issued, count(DISTINCT i2) as received
Complete Examples¶
📍 Import and Query¶
from docling_graph import run_pipeline, PipelineConfig
from neo4j import GraphDatabase
# 1. Extract and export
config = PipelineConfig(
source="invoices.pdf",
template="templates.BillingDocument",
export_format="cypher",
output_dir="neo4j_data"
)
run_pipeline(config)
# 2. Import to Neo4j
driver = GraphDatabase.driver(
"bolt://localhost:7687",
auth=("neo4j", "password")
)
with open("neo4j_data/graph.cypher") as f:
cypher_script = f.read()
with driver.session() as session:
session.run(cypher_script)
# 3. Query
with driver.session() as session:
result = session.run("""
MATCH (i:BillingDocument)
RETURN i.document_no, i.total
ORDER BY i.total DESC
LIMIT 5
""")
for record in result:
print(f"{record['i.document_no']}: ${record['i.total']}")
driver.close()
📍 Batch Import¶
from docling_graph import run_pipeline, PipelineConfig
from pathlib import Path
import subprocess
# Process multiple documents
for pdf_file in Path("documents").glob("*.pdf"):
print(f"Processing {pdf_file.name}")
# Extract
config = PipelineConfig(
source=str(pdf_file),
template="templates.BillingDocument",
export_format="cypher",
output_dir=f"neo4j_batch/{pdf_file.stem}"
)
run_pipeline(config)
# Import
cypher_file = f"neo4j_batch/{pdf_file.stem}/graph.cypher"
subprocess.run([
"cypher-shell",
"-u", "neo4j",
"-p", "password",
"-f", cypher_file
])
print("✅ Batch import complete")
📍 Query and Export¶
from neo4j import GraphDatabase
import pandas as pd
# Connect
driver = GraphDatabase.driver(
"bolt://localhost:7687",
auth=("neo4j", "password")
)
# Query
with driver.session() as session:
result = session.run("""
MATCH (i:BillingDocument)-[:ISSUED_BY]->(o:Organization)
RETURN i.document_no as invoice,
o.name as organization,
i.total as amount
ORDER BY i.total DESC
""")
# Convert to DataFrame
df = pd.DataFrame([dict(record) for record in result])
# Export
df.to_csv("invoice_summary.csv", index=False)
print(f"Exported {len(df)} records")
driver.close()
Best Practices¶
👍 Clear Database Before Import¶
// Delete all nodes and relationships
MATCH (n)
DETACH DELETE n
// Verify empty
MATCH (n)
RETURN count(n)
👍 Create Indexes¶
// Create index on invoice number
CREATE INDEX document_no_idx FOR (i:BillingDocument) ON (i.document_no)
// Create index on organization name
CREATE INDEX org_name_idx FOR (o:Organization) ON (o.name)
// List indexes
SHOW INDEXES
👍 Use Constraints¶
// Unique constraint on invoice number
CREATE CONSTRAINT invoice_unique FOR (i:BillingDocument) REQUIRE i.document_no IS UNIQUE
// Existence constraint
CREATE CONSTRAINT document_no_exists FOR (i:BillingDocument) REQUIRE i.document_no IS NOT NULL
👍 Batch Imports¶
# ✅ Good - Import in batches
from neo4j import GraphDatabase
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
# Process in batches
batch_size = 1000
for i in range(0, len(statements), batch_size):
batch = statements[i:i+batch_size]
with driver.session() as session:
for statement in batch:
session.run(statement)
print(f"Imported batch {i//batch_size + 1}")
driver.close()
Troubleshooting¶
🐛 Connection Refused¶
Solution:
# Check Neo4j is running
docker ps | grep neo4j
# Or check service
systemctl status neo4j
# Restart if needed
docker restart neo4j
🐛 Authentication Failed¶
Solution:
# Reset password
cypher-shell -u neo4j -p neo4j
# Then change password when prompted
# Or set in Docker
docker run -e NEO4J_AUTH=neo4j/newpassword neo4j
🐛 Import Fails¶
Solution:
# Check Cypher syntax
head -20 neo4j_import/graph.cypher
# Test small portion
head -100 neo4j_import/graph.cypher | cypher-shell -u neo4j -p password
# Check logs
docker logs neo4j
🐛 Slow Queries¶
Solution:
// Create indexes
CREATE INDEX FOR (i:BillingDocument) ON (i.document_no)
// Use EXPLAIN to analyze
EXPLAIN MATCH (i:BillingDocument) WHERE i.total > 1000 RETURN i
// Use PROFILE for detailed analysis
PROFILE MATCH (i:BillingDocument) WHERE i.total > 1000 RETURN i
Advanced Topics¶
Graph Algorithms¶
// Find shortest path
MATCH path = shortestPath(
(start:BillingDocument {document_no: "INV-001"})-[*]-(end:Address)
)
RETURN path
// PageRank (requires APOC or GDS)
CALL gds.pageRank.stream('myGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC
Full-Text Search¶
// Create full-text index
CREATE FULLTEXT INDEX organization_search FOR (o:Organization) ON EACH [o.name, o.description]
// Search
CALL db.index.fulltext.queryNodes('organization_search', 'Acme')
YIELD node, score
RETURN node.name, score
Next Steps¶
Now that you understand Neo4j integration:
- Graph Analysis → - Analyze graph structure
- CLI Guide → - Use command-line tools
- API Reference → - Programmatic access