URL Input Example¶
Overview¶
This example demonstrates how to process documents directly from URLs, showcasing Docling Graph's ability to download and extract data from remote documents without manual file management.
Time: 10 minutes
Use Case: Rheology Research Analysis¶
Extract structured information from a scientific paper hosted on arXiv, including authors, abstract, methodology, and key findings.
Document Source¶
URL: https://arxiv.org/pdf/2207.02720
Type: PDF (Rheology Research on Rheology)
Content: Scientific paper with complex structure including authors, abstract, methodology, results, and references.
Template Definition¶
We'll use a rheology research template that captures the essential structure of scientific documents.
from pydantic import BaseModel, Field
from docling_graph.utils import edge
class Author(BaseModel):
"""Author entity."""
model_config = {
'is_entity': True,
'graph_id_fields': ['name']
}
name: str = Field(description="Author's full name")
affiliation: str | None = Field(
default=None,
description="Author's institutional affiliation"
)
class Methodology(BaseModel):
"""Research methodology component."""
model_config = {'is_entity': False}
approach: str = Field(description="Research approach or method used")
materials: list[str] = Field(
default_factory=list,
description="Materials or tools used"
)
procedure: str = Field(description="Experimental or analytical procedure")
class Finding(BaseModel):
"""Key research finding."""
model_config = {'is_entity': False}
description: str = Field(description="Description of the finding")
significance: str = Field(description="Significance or implication")
class Research(BaseModel):
"""Complete rheology research structure."""
model_config = {'is_entity': True}
title: str = Field(description="Paper title")
abstract: str = Field(description="Paper abstract")
authors: list[Author] = edge(
"AUTHORED_BY",
description="Paper authors"
)
methodology: Methodology = Field(description="Research methodology")
key_findings: list[Finding] = Field(
default_factory=list,
description="Key research findings"
)
conclusion: str = Field(description="Paper conclusion")
Save as: templates/research.py
Processing with CLI¶
Basic URL Processing¶
# Process rheology research from URL
uv run docling-graph convert "https://arxiv.org/pdf/2207.02720" \
--template "docs.examples.templates.rheology_research.ScholarlyRheologyPaper" \
--processing-mode "many-to-one" \
--backend llm \
--inference remote
With Custom Output¶
# Process with custom output directory
uv run docling-graph convert "https://arxiv.org/pdf/2207.02720" \
--template "templates.rheology_research.ScholarlyRheologyPaper" \
--processing-mode "many-to-one" \
--output-dir "outputs/research_paper" \
--export-format json
With Specific Model¶
# Use specific LLM model
uv run docling-graph convert "https://arxiv.org/pdf/2207.02720" \
--template "templates.rheology_research.ScholarlyRheologyPaper" \
--processing-mode "many-to-one" \
--backend llm \
--inference remote \
--provider openai \
--model gpt-4-turbo
Processing with Python API¶
Basic Usage¶
from docling_graph import run_pipeline, PipelineConfig
from templates.rheology_research import ScholarlyRheologyPaper
# Configure pipeline for URL input
config = PipelineConfig(
source="https://arxiv.org/pdf/2207.02720",
template=Research,
backend="llm",
inference="remote",
processing_mode="many-to-one"
)
# Run pipeline
run_pipeline(config)
With Custom Settings¶
from docling_graph import run_pipeline, PipelineConfig
from templates.rheology_research import ScholarlyRheologyPaper
# Advanced configuration
config = PipelineConfig(
source="https://arxiv.org/pdf/2207.02720",
template=Research,
backend="llm",
inference="remote",
processing_mode="many-to-one",
provider_override="mistral",
model_override="mistral-large-latest",
use_chunking=True,
llm_consolidation=True,
export_format="json"
)
# Run pipeline
run_pipeline(config)
Error Handling¶
from docling_graph import run_pipeline, PipelineConfig
from docling_graph.exceptions import ValidationError, ExtractionError
from templates.rheology_research import ScholarlyRheologyPaper
try:
config = PipelineConfig(
source="https://arxiv.org/pdf/2207.02720",
template=Research,
backend="llm",
inference="remote",
processing_mode="many-to-one"
)
run_pipeline(config)
except ValidationError as e:
print(f"URL validation failed: {e.message}")
if e.details:
print(f"Details: {e.details}")
except ExtractionError as e:
print(f"Extraction failed: {e.message}")
# Handle extraction errors (e.g., retry with different model)
Expected Output¶
Graph Structure¶
Research (root node)
├── AUTHORED_BY → Author (John Doe)
├── AUTHORED_BY → Author (Jane Smith)
├── methodology (embedded)
│ ├── approach: "Experimental rheology"
│ ├── materials: ["Polymer samples", "Rheometer"]
│ └── procedure: "..."
├── key_findings (list)
│ ├── Finding 1: "..."
│ └── Finding 2: "..."
└── conclusion: "..."
CSV Export¶
nodes.csv:
node_id,node_type,title,abstract,conclusion
research_1,Research,"Paper Title","Abstract text...","Conclusion text..."
node_id,node_type,name,affiliation
author_john_doe,Author,"John Doe","University X"
author_jane_smith,Author,"Jane Smith","Institute Y"
edges.csv:
source_id,target_id,edge_type
research_1,author_john_doe,AUTHORED_BY
research_1,author_jane_smith,AUTHORED_BY
JSON Export¶
{
"nodes": [
{
"id": "research_1",
"type": "Research",
"properties": {
"title": "Paper Title",
"abstract": "Abstract text...",
"conclusion": "Conclusion text..."
}
},
{
"id": "author_john_doe",
"type": "Author",
"properties": {
"name": "John Doe",
"affiliation": "University X"
}
}
],
"edges": [
{
"source": "research_1",
"target": "author_john_doe",
"type": "AUTHORED_BY"
}
]
}
URL Processing Features¶
Automatic Download¶
The pipeline automatically: 1. Downloads the PDF from the URL 2. Saves to temporary location 3. Detects content type (PDF) 4. Routes to appropriate processing pipeline 5. Cleans up temporary files
Content Type Detection¶
Supported URL content types: - PDF documents → Full document pipeline - Images (PNG, JPG) → Full document pipeline - Text files → Text-only pipeline (LLM backend required) - Markdown files → Text-only pipeline (LLM backend required)
Configuration Options¶
from docling_graph.core.input.handlers import URLInputHandler
# Custom URL handler settings
handler = URLInputHandler(
timeout=60, # Download timeout in seconds
max_size_mb=100 # Maximum file size in MB
)
Troubleshooting¶
🐛 URL Download Timeout¶
Error:
Solution:
# Increase timeout for large files
from docling_graph.core.input.handlers import URLInputHandler
handler = URLInputHandler(timeout=120) # 2 minutes
🐛 File Too Large¶
Error:
Solution:
# Increase size limit or download manually
handler = URLInputHandler(max_size_mb=200)
# Or download manually first
import requests
response = requests.get(url)
with open("document.pdf", "wb") as f:
f.write(response.content)
# Then process local file
config = PipelineConfig(source="document.pdf", ...)
🐛 Unsupported URL Scheme¶
Error:
Solution:
# Only HTTP/HTTPS URLs are supported
# For FTP or other protocols, download manually first
wget ftp://example.com/file.pdf
uv run docling-graph convert file.pdf --template "..."
Best Practices¶
👍 Use HTTPS When Available¶
# ✅ Good - Secure connection
source = "https://arxiv.org/pdf/2207.02720"
# ⚠️ Avoid - Insecure connection
source = "http://example.com/document.pdf"
👍 Handle Network Errors¶
from docling_graph.exceptions import ValidationError
try:
run_pipeline(config)
except ValidationError as e:
if "timeout" in str(e).lower():
print("Network timeout - retrying with longer timeout")
# Retry logic
elif "failed to download" in str(e).lower():
print("Download failed - check URL and network connection")
👍 Verify URL Before Processing¶
import requests
def verify_url(url: str) -> bool:
"""Verify URL is accessible before processing."""
try:
response = requests.head(url, timeout=10)
return response.status_code == 200
except:
return False
if verify_url(url):
config = PipelineConfig(source=url, ...)
run_pipeline(config)
else:
print(f"URL not accessible: {url}")
👍 Cache Downloaded Files¶
from pathlib import Path
import hashlib
def get_cache_path(url: str) -> Path:
"""Generate cache path for URL."""
url_hash = hashlib.md5(url.encode()).hexdigest()
return Path(f"cache/{url_hash}.pdf")
cache_path = get_cache_path(url)
if cache_path.exists():
# Use cached file
config = PipelineConfig(source=str(cache_path), ...)
else:
# Download from URL
config = PipelineConfig(source=url, ...)
Next Steps¶
- Markdown Input → - Process markdown documents
- DoclingDocument Input → - Use pre-processed documents
- Input Formats Guide - Complete input format reference