Rheology Research Extraction¶
Overview¶
Extract complex research data from scientific papers including experiments, measurements, materials, and results.
Document Type: Rheology Research (PDF)
Time: 30 minutes
Backend: LLM with chunking
Prerequisites¶
Template Overview¶
The rheology research template (rheology_research.py) includes:
- Measurements - Flexible value/unit pairs
- Materials - Granular material properties
- Geometry - Experimental setup
- Vibration - Vibration parameters
- Simulation - DEM simulation details
- Results - Rheological measurements
- Experiments - Complete experiment instances
- Research - Root document model
Key Components¶
# 1. Measurement Model
class Measurement(BaseModel):
"""Flexible measurement with value and unit."""
name: str
numeric_value: float | None = None
text_value: str | None = None
unit: str | None = None
# 2. Enum Types
class GeometryType(str, Enum):
VANE_RHEOMETER = "Vane Rheometer"
DOUBLE_PLATE = "Double Plate"
CYLINDRICAL_CONTAINER = "Cylindrical Container"
# 3. Experiment Entity
class Experiment(BaseModel):
experiment_id: str
objective: str
granular_material: GranularMaterial = edge("USES_MATERIAL")
vibration_conditions: VibrationConditions = edge("HAS_VIBRATION")
rheological_results: List[RheologicalResult] = edge("HAS_RESULT")
# 4. Root Model
class Research(BaseModel):
title: str
authors: List[str]
experiments: List[Experiment] = edge("HAS_EXPERIMENT")
Processing¶
Using CLI¶
# Process rheology research with chunking
uv run docling-graph convert research.pdf \
--template "docs.examples.templates.rheology_research.ScholarlyRheologyPaper" \
--backend llm \
--inference remote \
--provider mistral \
--model mistral-large-latest \
--processing-mode many-to-one \
--use-chunking \
--llm-consolidation \
--docling-pipeline vision \
--output-dir "outputs/research"
Using Python API¶
"""Process rheology research."""
import os
from docling_graph import run_pipeline, PipelineConfig
os.environ["MISTRAL_API_KEY"] = "your-key"
config = PipelineConfig(
source="research.pdf",
template="docs.examples.templates.rheology_research.ScholarlyRheologyPaper",
backend="llm",
inference="remote",
provider_override="mistral",
model_override="mistral-large-latest",
processing_mode="many-to-one",
use_chunking=True,
llm_consolidation=True,
docling_config="vision" # Better for complex layouts
)
print("Processing rheology research (may take several minutes)...")
run_pipeline(config)
print("✅ Complete!")
Expected Results¶
Graph Structure¶
Research (Title)
├── HAS_EXPERIMENT → Experiment 1
│ ├── USES_MATERIAL → GranularMaterial
│ │ └── properties: [Measurement, Measurement]
│ ├── HAS_GEOMETRY → SystemGeometry
│ │ └── dimensions: [Measurement, Measurement]
│ ├── HAS_VIBRATION → VibrationConditions
│ │ ├── amplitude: Measurement
│ │ ├── frequency: Measurement
│ │ └── confining_pressure: Measurement
│ ├── HAS_SIMULATION → SimulationSetup
│ │ └── parameters: [Measurement, Measurement]
│ └── HAS_RESULT → RheologicalResult
│ └── measurement: Measurement
└── HAS_EXPERIMENT → Experiment 2
└── ...
Statistics¶
{
"node_count": 45,
"edge_count": 38,
"density": 0.019,
"node_types": {
"Research": 1,
"Experiment": 3,
"GranularMaterial": 3,
"SystemGeometry": 3,
"VibrationConditions": 3,
"RheologicalResult": 12,
"Measurement": 20
}
}
Key Features¶
1. Enum Normalization¶
class GeometryType(str, Enum):
VANE_RHEOMETER = "Vane Rheometer"
CYLINDRICAL_CONTAINER = "Cylindrical Container"
# Validator accepts multiple formats
@field_validator("geometry_type", mode="before")
@classmethod
def normalize_enum(cls, v):
# Accepts: "Vane Rheometer", "vane_rheometer", "VANE_RHEOMETER"
return _normalize_enum(GeometryType, v)
2. Measurement Parsing¶
# Parses strings like "1.6 mPa.s", "2 mm", "80-90 °C"
def _parse_measurement_string(s: str):
# Single value: "1.6 mPa.s" → {numeric_value: 1.6, unit: "mPa.s"}
# Range: "80-90 °C" → {numeric_value_min: 80, numeric_value_max: 90, unit: "°C"}
...
3. Flexible Measurements¶
class Measurement(BaseModel):
name: str
numeric_value: float | None = None # Single value
numeric_value_min: float | None = None # Range min
numeric_value_max: float | None = None # Range max
text_value: str | None = None # Qualitative
unit: str | None = None
4. Nested Relationships¶
class Experiment(BaseModel):
# Direct edges
granular_material: GranularMaterial = edge("USES_MATERIAL")
# Nested properties (not separate nodes)
key_findings: List[str] = Field(default_factory=list)
Configuration Tips¶
For Long Documents¶
# Enable chunking and consolidation
uv run docling-graph convert research.pdf \
--template "templates.ScholarlyRheologyPaper" \
--use-chunking \
--llm-consolidation \
--processing-mode many-to-one
For Complex Layouts¶
# Use vision pipeline for better table/figure handling
uv run docling-graph convert research.pdf \
--template "templates.ScholarlyRheologyPaper" \
--docling-pipeline vision
For Cost Optimization¶
# Use smaller model without consolidation
uv run docling-graph convert research.pdf \
--template "templates.ScholarlyRheologyPaper" \
--model mistral-small-latest \
--no-llm-consolidation
Customization¶
Simplify for Your Domain¶
"""Simplified research template."""
from pydantic import BaseModel, Field
from typing import List
def edge(label: str, **kwargs):
return Field(..., json_schema_extra={"edge_label": label}, **kwargs)
class Measurement(BaseModel):
"""Simple measurement."""
name: str
value: str # Keep as string for simplicity
unit: str | None = None
class Experiment(BaseModel):
"""Simplified experiment."""
title: str
objective: str
methods: str
results: str
measurements: List[Measurement] = Field(default_factory=list)
class Research(BaseModel):
"""Simplified rheology research (for demonstration).
Note: For production use, see the full ScholarlyRheologyPaper template at:
docs/examples/templates/rheology_research.py
The full template includes:
- Comprehensive scholarly metadata (authors, affiliations, identifiers)
- Detailed formulation specifications (materials, components, amounts)
- Batch preparation history (mixing steps, equipment, conditions)
- Complete rheometry setup (instruments, geometries, protocols)
- Test runs and datasets (curves, measurements, model fits)
"""
title: str
authors: List[str]
abstract: str
experiments: List[Experiment] = edge("HAS_EXPERIMENT")
Troubleshooting¶
🐛 Extraction Takes Too Long¶
Solution:
# Disable consolidation for faster processing
uv run docling-graph convert research.pdf \
--template "templates.ScholarlyRheologyPaper" \
--no-llm-consolidation
# Or use smaller model
--model mistral-small-latest
🐛 Missing Measurements¶
Solution:
# Make measurements optional
measurements: List[Measurement] = Field(
default_factory=list,
description="List of measurements (optional)"
)
🐛 Enum Validation Errors¶
Solution:
# Add OTHER option to enums
class GeometryType(str, Enum):
VANE_RHEOMETER = "Vane Rheometer"
OTHER = "Other" # Fallback
# Or make enum optional
geometry_type: GeometryType | None = Field(default=None)
Best Practices¶
👍 Start Simple, Add Complexity¶
# Phase 1: Basic structure
class Research(BaseModel):
title: str
authors: List[str]
abstract: str
# Phase 2: Add experiments
class Research(BaseModel):
title: str
authors: List[str]
abstract: str
experiments: List[Experiment]
# Phase 3: Add measurements, validations, etc.
👍 Use Appropriate Chunking¶
# For papers > 10 pages
config = PipelineConfig(
source="long_paper.pdf",
template="templates.ScholarlyRheologyPaper",
use_chunking=True, # Essential
llm_consolidation=True # Better accuracy
)
👍 Provide Clear Examples¶
# ✅ Good - Domain-specific examples
viscosity: Measurement = Field(
description="Effective viscosity measurement",
examples=[
{"name": "Effective Viscosity", "numeric_value": 1.6, "unit": "mPa.s"}
]
)
Next Steps¶
- ID Card → - Vision-based extraction
- Advanced Patterns → - Complex templates
- Performance Tuning → - Optimization