Neo4j Integration¶

Overview¶

Neo4j integration enables you to import knowledge graphs into Neo4j graph database for powerful querying, analysis, and visualization using Cypher query language.

In this guide: - Neo4j setup - Cypher import - Query examples - Best practices - Troubleshooting

Why Neo4j?¶

Benefits¶

✅ Graph-native database - Optimized for graph queries - Fast relationship traversal - ACID transactions

✅ Cypher query language - Intuitive pattern matching - Powerful aggregations - Path finding algorithms

✅ Visualization - Built-in graph browser - Interactive exploration - Custom styling

✅ Scalability - Handles millions of nodes - Distributed architecture - High performance

Neo4j Setup¶

Installation¶

Option 1: Neo4j Desktop (Recommended)¶

# Download from https://neo4j.com/download/
# Install and create a new database
# Default credentials: neo4j/neo4j (change on first login)

Option 2: Docker¶

# Run Neo4j in Docker
docker run \
    --name neo4j \
    -p 7474:7474 -p 7687:7687 \
    -e NEO4J_AUTH=neo4j/password \
    neo4j:latest

# Access at http://localhost:7474

Option 3: Cloud (Neo4j Aura)¶

# Sign up at https://neo4j.com/cloud/aura/
# Create free instance
# Note connection URI and credentials

Verify Installation¶

# Check Neo4j is running
curl http://localhost:7474

# Test cypher-shell
cypher-shell -u neo4j -p password "RETURN 1"

Exporting for Neo4j¶

Generate Cypher Script¶

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    export_format="cypher",  # Generate Cypher script
    output_dir="neo4j_import"
)

run_pipeline(config)

# Generates: neo4j_import/graph.cypher

Importing to Neo4j¶

Method 1: cypher-shell (Recommended)¶

# Import Cypher script
cat neo4j_import/graph.cypher | cypher-shell -u neo4j -p password

# Or with file
cypher-shell -u neo4j -p password -f neo4j_import/graph.cypher

# With error logging
cat neo4j_import/graph.cypher | cypher-shell -u neo4j -p password 2>&1 | tee import.log

Method 2: Neo4j Browser¶

Open Neo4j Browser (http://localhost:7474)
Login with credentials
Open graph.cypher file
Copy contents
Paste into query editor
Click "Run" or press Ctrl+Enter

Method 3: Python Driver¶

from neo4j import GraphDatabase

# Connect to Neo4j
driver = GraphDatabase.driver(
    "bolt://localhost:7687",
    auth=("neo4j", "password")
)

# Read Cypher file
with open("neo4j_import/graph.cypher") as f:
    cypher_script = f.read()

# Execute
with driver.session() as session:
    session.run(cypher_script)

driver.close()
print("✅ Imported to Neo4j")

Method 4: Automated Import¶

from docling_graph import run_pipeline, PipelineConfig
import subprocess

# Extract and export
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    export_format="cypher",
    output_dir="neo4j_import"
)

run_pipeline(config)

# Import to Neo4j
result = subprocess.run([
    "cypher-shell",
    "-u", "neo4j",
    "-p", "password",
    "-f", "neo4j_import/graph.cypher"
], capture_output=True, text=True)

if result.returncode == 0:
    print("✅ Successfully imported to Neo4j")
else:
    print(f"❌ Import failed: {result.stderr}")

Querying Neo4j¶

Basic Queries¶

Count Nodes¶

// Count all nodes
MATCH (n)
RETURN count(n) as total_nodes

// Count by type
MATCH (n)
RETURN labels(n) as type, count(n) as count
ORDER BY count DESC

Count Relationships¶

// Count all relationships
MATCH ()-[r]->()
RETURN count(r) as total_relationships

// Count by type
MATCH ()-[r]->()
RETURN type(r) as relationship_type, count(r) as count
ORDER BY count DESC

Finding Nodes¶

Find Specific Node¶

// Find invoice by number
MATCH (i:BillingDocument {document_no: "INV-001"})
RETURN i

// Find organization by name
MATCH (o:Organization {name: "Acme Corp"})
RETURN o

Find All of Type¶

// Find all invoices
MATCH (i:BillingDocument)
RETURN i
LIMIT 10

// Find all organizations
MATCH (o:Organization)
RETURN o.name, o.address

Relationship Queries¶

Direct Relationships¶

// Find who issued an invoice
MATCH (i:BillingDocument {document_no: "INV-001"})-[:ISSUED_BY]->(o:Organization)
RETURN i.document_no, o.name

// Find all line items in an invoice
MATCH (i:BillingDocument)-[:CONTAINS_LINE]->(item:LineItem)
WHERE i.document_no = "INV-001"
RETURN item.description, item.total

Multi-Hop Relationships¶

// Find invoice -> organization -> address
MATCH (i:BillingDocument)-[:ISSUED_BY]->(o:Organization)-[:LOCATED_AT]->(a:Address)
RETURN i.document_no, o.name, a.city

// Find all paths between two nodes
MATCH path = (start:BillingDocument)-[*..3]-(end:Address)
WHERE start.document_no = "INV-001"
RETURN path

Aggregation Queries¶

Sum and Average¶

// Total invoice amount
MATCH (i:BillingDocument)
RETURN sum(i.total) as total_amount

// Average invoice amount
MATCH (i:BillingDocument)
RETURN avg(i.total) as average_amount

// Count invoices per organization
MATCH (o:Organization)<-[:ISSUED_BY]-(i:BillingDocument)
RETURN o.name, count(i) as invoice_count
ORDER BY invoice_count DESC

Pattern Matching¶

Complex Patterns¶

// Find invoices with specific pattern
MATCH (i:BillingDocument)-[:ISSUED_BY]->(o:Organization),
      (i)-[:SENT_TO]->(c:Organization),
      (i)-[:CONTAINS_LINE]->(item:LineItem)
WHERE i.total > 1000
RETURN i, o, c, collect(item) as items

// Find organizations that both issue and receive invoices
MATCH (o:Organization)<-[:ISSUED_BY]-(i1:BillingDocument),
      (o)<-[:SENT_TO]-(i2:BillingDocument)
RETURN o.name, count(DISTINCT i1) as issued, count(DISTINCT i2) as received

Complete Examples¶

📍 Import and Query¶

from docling_graph import run_pipeline, PipelineConfig
from neo4j import GraphDatabase

# 1. Extract and export
config = PipelineConfig(
    source="invoices.pdf",
    template="templates.BillingDocument",
    export_format="cypher",
    output_dir="neo4j_data"
)

run_pipeline(config)

# 2. Import to Neo4j
driver = GraphDatabase.driver(
    "bolt://localhost:7687",
    auth=("neo4j", "password")
)

with open("neo4j_data/graph.cypher") as f:
    cypher_script = f.read()

with driver.session() as session:
    session.run(cypher_script)

# 3. Query
with driver.session() as session:
    result = session.run("""
        MATCH (i:BillingDocument)
        RETURN i.document_no, i.total
        ORDER BY i.total DESC
        LIMIT 5
    """)

    for record in result:
        print(f"{record['i.document_no']}: ${record['i.total']}")

driver.close()

📍 Batch Import¶

from docling_graph import run_pipeline, PipelineConfig
from pathlib import Path
import subprocess

# Process multiple documents
for pdf_file in Path("documents").glob("*.pdf"):
    print(f"Processing {pdf_file.name}")

    # Extract
    config = PipelineConfig(
        source=str(pdf_file),
        template="templates.BillingDocument",
        export_format="cypher",
        output_dir=f"neo4j_batch/{pdf_file.stem}"
    )

    run_pipeline(config)

    # Import
    cypher_file = f"neo4j_batch/{pdf_file.stem}/graph.cypher"
    subprocess.run([
        "cypher-shell",
        "-u", "neo4j",
        "-p", "password",
        "-f", cypher_file
    ])

print("✅ Batch import complete")

📍 Query and Export¶

from neo4j import GraphDatabase
import pandas as pd

# Connect
driver = GraphDatabase.driver(
    "bolt://localhost:7687",
    auth=("neo4j", "password")
)

# Query
with driver.session() as session:
    result = session.run("""
        MATCH (i:BillingDocument)-[:ISSUED_BY]->(o:Organization)
        RETURN i.document_no as invoice,
               o.name as organization,
               i.total as amount
        ORDER BY i.total DESC
    """)

    # Convert to DataFrame
    df = pd.DataFrame([dict(record) for record in result])

    # Export
    df.to_csv("invoice_summary.csv", index=False)
    print(f"Exported {len(df)} records")

driver.close()

Best Practices¶

👍 Clear Database Before Import¶

// Delete all nodes and relationships
MATCH (n)
DETACH DELETE n

// Verify empty
MATCH (n)
RETURN count(n)

👍 Create Indexes¶

// Create index on invoice number
CREATE INDEX document_no_idx FOR (i:BillingDocument) ON (i.document_no)

// Create index on organization name
CREATE INDEX org_name_idx FOR (o:Organization) ON (o.name)

// List indexes
SHOW INDEXES

👍 Use Constraints¶

// Unique constraint on invoice number
CREATE CONSTRAINT invoice_unique FOR (i:BillingDocument) REQUIRE i.document_no IS UNIQUE

// Existence constraint
CREATE CONSTRAINT document_no_exists FOR (i:BillingDocument) REQUIRE i.document_no IS NOT NULL

👍 Batch Imports¶

# ✅ Good - Import in batches
from neo4j import GraphDatabase

driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

# Process in batches
batch_size = 1000
for i in range(0, len(statements), batch_size):
    batch = statements[i:i+batch_size]

    with driver.session() as session:
        for statement in batch:
            session.run(statement)

    print(f"Imported batch {i//batch_size + 1}")

driver.close()

Troubleshooting¶

🐛 Connection Refused¶

Solution:

# Check Neo4j is running
docker ps | grep neo4j

# Or check service
systemctl status neo4j

# Restart if needed
docker restart neo4j

🐛 Authentication Failed¶

Solution:

# Reset password
cypher-shell -u neo4j -p neo4j
# Then change password when prompted

# Or set in Docker
docker run -e NEO4J_AUTH=neo4j/newpassword neo4j

🐛 Import Fails¶

Solution:

# Check Cypher syntax
head -20 neo4j_import/graph.cypher

# Test small portion
head -100 neo4j_import/graph.cypher | cypher-shell -u neo4j -p password

# Check logs
docker logs neo4j

🐛 Slow Queries¶

Solution:

// Create indexes
CREATE INDEX FOR (i:BillingDocument) ON (i.document_no)

// Use EXPLAIN to analyze
EXPLAIN MATCH (i:BillingDocument) WHERE i.total > 1000 RETURN i

// Use PROFILE for detailed analysis
PROFILE MATCH (i:BillingDocument) WHERE i.total > 1000 RETURN i

Advanced Topics¶

Graph Algorithms¶

// Find shortest path
MATCH path = shortestPath(
    (start:BillingDocument {document_no: "INV-001"})-[*]-(end:Address)
)
RETURN path

// PageRank (requires APOC or GDS)
CALL gds.pageRank.stream('myGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC

Full-Text Search¶

// Create full-text index
CREATE FULLTEXT INDEX organization_search FOR (o:Organization) ON EACH [o.name, o.description]

// Search
CALL db.index.fulltext.queryNodes('organization_search', 'Acme')
YIELD node, score
RETURN node.name, score

Next Steps¶

Now that you understand Neo4j integration:

Graph Analysis → - Analyze graph structure
CLI Guide → - Use command-line tools
API Reference → - Programmatic access