Python API¶
Overview¶
The docling-graph Python API provides programmatic access to the document-to-graph pipeline, enabling integration into Python applications, notebooks, and workflows.
Key Components:
- run_pipeline() - Main pipeline function
- PipelineConfig - Type-safe configuration
- Direct module imports for advanced usage
Quick Start¶
Basic Usage (API Mode - No File Exports)¶
from docling_graph import run_pipeline, PipelineConfig
# Configure pipeline
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="remote"
)
# Run pipeline - returns data directly
context = run_pipeline(config)
# Access results in memory
graph = context.knowledge_graph
model = context.pydantic_model
print(f"Extracted {graph.number_of_nodes()} nodes")
Installation¶
# Install with all features
uv sync
# Or specific features
uv sync # Remote APIs
uv sync # Local inference
API Components¶
1. PipelineConfig¶
Type-safe configuration class with validation.
from docling_graph import PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="remote"
)
Learn more: PipelineConfig →
2. run_pipeline()¶
Main pipeline execution function.
from docling_graph import run_pipeline
run_pipeline({
"source": "document.pdf",
"template": "templates.BillingDocument"
})
Learn more: run_pipeline() →
3. Direct Module Access¶
For advanced usage, import modules directly.
from docling_graph.core.converters import GraphConverter
from docling_graph.core.exporters import CSVExporter
from docling_graph.core.visualizers import InteractiveVisualizer
Learn more: API Reference →
Common Patterns¶
Pattern 1: Simple Conversion (Memory-Efficient)¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="invoice.pdf",
template="templates.BillingDocument"
)
# Returns data directly - no file exports
context = run_pipeline(config)
graph = context.knowledge_graph
invoice = context.pydantic_model
Pattern 2: Custom Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="research.pdf",
template="templates.ScholarlyRheologyPaper",
backend="llm",
inference="remote",
provider_override="mistral",
model_override="mistral-large-latest",
processing_mode="many-to-one",
use_chunking=True,
llm_consolidation=True
)
# Access results in memory
context = run_pipeline(config)
graph = context.knowledge_graph
print(f"Research: {graph.number_of_nodes()} nodes, {graph.number_of_edges()} edges")
Pattern 3: Batch Processing (Memory-Efficient)¶
from pathlib import Path
from docling_graph import run_pipeline, PipelineConfig
documents = Path("documents").glob("*.pdf")
all_graphs = []
for doc in documents:
config = PipelineConfig(
source=str(doc),
template="templates.BillingDocument"
)
try:
# Process without file exports
context = run_pipeline(config)
all_graphs.append({
"filename": doc.name,
"graph": context.knowledge_graph,
"model": context.pydantic_model
})
print(f"✅ Processed: {doc.name}")
except Exception as e:
print(f"❌ Failed: {doc.name} - {e}")
# Aggregate results
total_nodes = sum(g["graph"].number_of_nodes() for g in all_graphs)
print(f"\nTotal entities: {total_nodes}")
Pattern 4: Error Handling¶
from docling_graph import PipelineConfig
from docling_graph.exceptions import (
ConfigurationError,
ExtractionError,
PipelineError
)
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument"
)
try:
run_pipeline(config)
except ConfigurationError as e:
print(f"Configuration error: {e.message}")
print(f"Details: {e.details}")
except ExtractionError as e:
print(f"Extraction failed: {e.message}")
except PipelineError as e:
print(f"Pipeline error: {e.message}")
Comparison: CLI vs Python API¶
| Feature | CLI | Python API |
|---|---|---|
| Ease of Use | Simple commands | Requires Python code |
| Flexibility | Limited to options | Full programmatic control |
| Integration | Shell scripts | Python applications |
| File Exports | Always exports files | No exports by default (memory-efficient) |
| Return Values | N/A | Returns PipelineContext with graph and model |
| Batch Processing | Shell loops | Python loops with error handling |
| Configuration | YAML + flags | PipelineConfig objects |
| Best For | Quick tasks, scripts | Applications, notebooks, workflows |
Python API export behavior
Python API defaults to dump_to_disk=False for memory efficiency. Set dump_to_disk=True to enable file exports.
Environment Setup¶
API Keys¶
import os
# Set API keys programmatically
os.environ["MISTRAL_API_KEY"] = "your-key"
os.environ["OPENAI_API_KEY"] = "your-key"
# Or use python-dotenv
from dotenv import load_dotenv
load_dotenv()
from docling_graph import PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
inference="remote"
)
run_pipeline(config)
Python Path¶
import sys
from pathlib import Path
# Add project root to path
project_root = Path(__file__).parent.parent
sys.path.append(str(project_root))
# Now you can import templates
from templates.billing_document import BillingDocument
from docling_graph import PipelineConfig
config = PipelineConfig(
source="document.pdf",
template=Invoice # Pass class directly
)
run_pipeline(config)
Integration Examples¶
Flask Web Application¶
from flask import Flask, request, jsonify
from docling_graph import PipelineConfig
from pathlib import Path
import uuid
app = Flask(__name__)
@app.route('/convert', methods=['POST'])
def convert_document():
# Get uploaded file
file = request.files['document']
template = request.form.get('template', 'templates.BillingDocument')
# Save temporarily
temp_id = str(uuid.uuid4())
temp_path = f"temp/{temp_id}_{file.filename}"
file.save(temp_path)
# Process
try:
config = PipelineConfig(
source=temp_path,
template=template,
)
context = run_pipeline(config)
return jsonify({
"status": "success",
"model": context.pydantic_model.model_dump()
})
except Exception as e:
return jsonify({
"status": "error",
"message": str(e)
}), 500
finally:
# Cleanup
Path(temp_path).unlink(missing_ok=True)
if __name__ == '__main__':
app.run(debug=True)
Jupyter Notebook¶
# Cell 1: Setup
from docling_graph import PipelineConfig
import pandas as pd
import matplotlib.pyplot as plt
# Cell 2: Process document
config = PipelineConfig(
source="research.pdf",
template="templates.ScholarlyRheologyPaper"
)
context = run_pipeline(config)
# Cell 3: Analyze results
graph = context.knowledge_graph
print(f"Total nodes: {graph.number_of_nodes()}")
print(f"Total edges: {graph.number_of_edges()}")
# Cell 4: Visualize
node_types = nodes['type'].value_counts()
node_types.plot(kind='bar', title='Node Types')
plt.show()
Airflow DAG¶
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
from docling_graph import PipelineConfig
def process_document(**context):
config = PipelineConfig(
source=context['params']['source'],
template=context['params']['template']
)
run_pipeline(config)
with DAG(
'document_processing',
start_date=datetime(2024, 1, 1),
schedule_interval='@daily'
) as dag:
process_task = PythonOperator(
task_id='process_document',
python_callable=process_document,
params={
'source': 'documents/daily.pdf',
'template': 'templates.BillingDocument'
}
)
Best Practices¶
👍 Use Type-Safe Configuration¶
# ✅ Good - Type-safe with validation
from docling_graph import PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm" # Validated
)
# ❌ Avoid - Dictionary without validation
config = {
"source": "document.pdf",
"template": "templates.BillingDocument",
"backend": "invalid" # No validation
}
👍 Handle Errors Gracefully¶
# ✅ Good - Specific error handling
from docling_graph import PipelineConfig
from docling_graph.exceptions import ExtractionError
try:
run_pipeline(config)
except ExtractionError as e:
logger.error(f"Extraction failed: {e.message}")
# Implement retry logic or fallback
# ❌ Avoid - Catching all exceptions
try:
run_pipeline(config)
except Exception:
pass # Silent failure
Next Steps¶
Explore the Python API in detail:
- run_pipeline() → - Pipeline function
- PipelineConfig → - Configuration class
- Programmatic Examples → - Code examples
- Batch Processing → - Batch patterns
Or continue to: - Examples → - Real-world examples - API Reference → - Complete API docs