Best Practices and Checklist¶
Overview¶
This guide provides a comprehensive checklist and best practices for creating high-quality Pydantic templates. Use this as a final review before deploying your template.
In this guide: - Complete template checklist - Testing your template - Common pitfalls to avoid - Performance considerations - Maintenance tips
Complete Template Checklist¶
✅ Structure and Organization¶
- Module docstring - Clear description of template purpose
- Standard imports - All necessary imports included
- Edge helper function - Defined correctly with exact signature
- Logical organization - Components → Entities → Domain Models → Root
- Model docstrings - All models have clear docstrings
- Consistent naming - Follow Python naming conventions
✅ Entity Configuration¶
- graph_id_fields defined - All entities have appropriate ID fields
- ID fields are stable - Won't change frequently
- ID fields are likely present - Will be extracted from documents
- Composite IDs make sense - Multiple fields form natural unique identifier
✅ Component Configuration¶
- is_entity=False set - All components marked correctly
- Appropriate for sharing - Components represent value objects
- Content-based deduplication - All fields used for uniqueness
✅ Field Definitions¶
- Clear descriptions - LLM-friendly with extraction hints
- Realistic examples - 2-5 diverse examples per field
- Proper type hints - Optional, List, Union used correctly
- Appropriate defaults - Required (...), None, or meaningful defaults
- List fields use default_factory - Never use [] as default
✅ Edge Definitions¶
- Descriptive labels - ALL_CAPS_WITH_UNDERSCORES format
- Consistent naming - Same pattern across template
- List edges have default_factory - Required for list relationships
- Clear descriptions - Explain the relationship
- Appropriate cardinality - Single vs list chosen correctly
✅ Validation¶
- Field validators - Data quality checks where needed
- Model validators - Cross-field validation implemented
- Clear error messages - Specific, actionable errors
- Handle None values - Validators allow None for optional fields
- Pre-validators for normalization - mode='before' used appropriately
✅ String Representations¶
- str methods - Defined for entities and key components
- Handle None values - String methods don't crash on None
- Meaningful output - Useful for debugging and logging
✅ Type Hints and Consistency¶
- Proper type hints - All fields have correct types
- Consistent patterns - Similar fields use similar patterns
- No duplicate code - Reusable components extracted
Testing Your Template¶
Test 1: Basic Instantiation¶
# test_template_basic.py
from my_template import Document, Organization, Address
def test_basic_instantiation():
"""Test that models can be instantiated."""
doc = Document(
document_id="TEST-001",
issued_by=Organization(
name="Test Corp",
located_at=Address(
street="123 Test St",
city="Paris"
)
)
)
assert doc.document_id == "TEST-001"
assert doc.issued_by.name == "Test Corp"
print("✅ Basic instantiation works")
if __name__ == "__main__":
test_basic_instantiation()
Run with:
Test 2: Validation¶
# test_template_validation.py
from my_template import MonetaryAmount
import pytest
def test_positive_amount():
"""Test that negative amounts are rejected."""
with pytest.raises(ValueError, match="non-negative"):
MonetaryAmount(value=-100, currency="EUR")
print("✅ Validation works")
def test_valid_amount():
"""Test that positive amounts are accepted."""
amount = MonetaryAmount(value=100, currency="EUR")
assert amount.value == 100
print("✅ Valid data accepted")
if __name__ == "__main__":
test_positive_amount()
test_valid_amount()
Run with:
Test 3: Serialization¶
# test_template_serialization.py
from my_template import Document, Organization, Address
import json
def test_json_serialization():
"""Test that models can be serialized to JSON."""
doc = Document(
document_id="TEST-001",
issued_by=Organization(
name="Test Corp",
located_at=Address(
street="123 Test St",
city="Paris"
)
)
)
# Serialize to JSON
json_str = doc.model_dump_json(indent=2)
print("✅ JSON serialization works")
print(json_str)
# Deserialize from JSON
json_data = json.loads(json_str)
doc2 = Document(**json_data)
assert doc2.document_id == doc.document_id
print("✅ JSON deserialization works")
if __name__ == "__main__":
test_json_serialization()
Test 4: Edge Metadata¶
# test_template_edges.py
from my_template import Document
def test_edge_metadata():
"""Test that edge metadata is present."""
# Get field info
fields = Document.model_fields
# Check issued_by has edge metadata
issued_by_field = fields["issued_by"]
metadata = issued_by_field.json_schema_extra
assert metadata is not None
assert "edge_label" in metadata
assert metadata["edge_label"] == "ISSUED_BY"
print("✅ Edge metadata present")
if __name__ == "__main__":
test_edge_metadata()
Test 5: End-to-End Extraction¶
# test_template_extraction.py
"""Test template with actual extraction."""
def test_extraction():
"""Test extraction with a sample document."""
import subprocess
result = subprocess.run([
"uv", "run", "docling-graph", "convert",
"test_document.pdf",
"--template", "my_template.Document",
"--output-dir", "test_output",
"--backend", "llm",
"--inference", "local"
], capture_output=True, text=True)
assert result.returncode == 0, f"Extraction failed: {result.stderr}"
print("✅ End-to-end extraction works")
if __name__ == "__main__":
test_extraction()
Common Pitfalls to Avoid¶
❌ Wrong edge() Definition¶
# WRONG - Missing **kwargs
def edge(label: str) -> Any:
return Field(..., json_schema_extra={"edge_label": label})
# CORRECT
def edge(label: str, **kwargs: Any) -> Any:
return Field(..., json_schema_extra={"edge_label": label}, **kwargs)
❌ Missing default_factory for Lists¶
# WRONG
items: List[Item] = edge(label="CONTAINS_LINE")
# CORRECT
items: List[Item] = edge(
label="CONTAINS_LINE",
default_factory=list
)
❌ Mutable Default Values¶
# WRONG - Shared mutable object
items: List[str] = Field([])
# CORRECT
items: List[str] = Field(default_factory=list)
❌ Vague Descriptions¶
# WRONG
name: str = Field(..., description="Name")
# CORRECT
name: str = Field(
...,
description=(
"Full legal name of the organization. "
"Look for 'Company Name' or header text. "
"Include legal suffixes like 'Ltd', 'Inc'."
),
examples=["Acme Corp Ltd", "Tech Solutions Inc"]
)
❌ Inconsistent Edge Labels¶
# WRONG - Mixed formats
issued_by: Org = edge(label="issuedBy")
sent_to: Client = edge(label="SENT_TO")
has_items: List[Item] = edge(label="contains-item")
# CORRECT - Consistent ALL_CAPS_WITH_UNDERSCORES
issued_by: Org = edge(label="ISSUED_BY")
sent_to: Client = edge(label="SENT_TO")
has_items: List[Item] = edge(label="CONTAINS_LINE")
❌ Wrong Entity/Component Classification¶
# WRONG - Address as entity (creates duplicate nodes)
class Address(BaseModel):
model_config = ConfigDict(graph_id_fields=["street", "city"])
# CORRECT - Address as component (shared nodes)
class Address(BaseModel):
model_config = ConfigDict(is_entity=False)
❌ Unstable ID Fields¶
# WRONG - Email can change
class Person(BaseModel):
model_config = ConfigDict(graph_id_fields=["email"])
# CORRECT - Stable fields
class Person(BaseModel):
model_config = ConfigDict(
graph_id_fields=["first_name", "last_name", "date_of_birth"]
)
❌ Missing Validators¶
# WRONG - No validation
currency: str = Field(...)
# CORRECT - Validated
currency: str = Field(...)
@field_validator("currency")
@classmethod
def validate_currency(cls, v: Any) -> Any:
if v and not (len(v) == 3 and v.isupper()):
raise ValueError("Currency must be 3 uppercase letters")
return v
Performance Considerations¶
1. Keep Templates Focused¶
# ✅ Good - Focused template
class BillingDocument(BaseModel):
"""BillingDocument document."""
# Only invoice-related fields
# ❌ Bad - Kitchen sink template
class Document(BaseModel):
"""Generic document."""
# Hundreds of fields for every document type
2. Use Appropriate Validators¶
# ✅ Good - Simple validation
@field_validator("value")
@classmethod
def validate_positive(cls, v: Any) -> Any:
if v < 0:
raise ValueError("Must be non-negative")
return v
# ❌ Bad - Complex validation in validator
@field_validator("value")
@classmethod
def validate_complex(cls, v: Any) -> Any:
# Expensive database lookup
# Complex calculations
# Multiple API calls
return v
3. Minimize Nested Depth¶
# ✅ Good - Reasonable nesting (2-3 levels)
Invoice → LineItem → MonetaryAmount
# ❌ Bad - Excessive nesting (5+ levels)
Document → Section → Subsection → Paragraph → Sentence → Word
Maintenance Tips¶
1. Version Your Templates¶
"""
BillingDocument extraction template.
Version: 2.0.0
Last Updated: 2024-01-15
Changes:
- Added payment_terms field
- Updated Organization to include tax_id
- Fixed email validation
"""
2. Document Breaking Changes¶
"""
BREAKING CHANGES in v2.0.0:
- Renamed 'bill_no' to 'document_no'
- Changed 'date' from str to date type
- Removed deprecated 'legacy_field'
"""
3. Keep Examples Updated¶
# ✅ Good - Current examples
document_no: str = Field(
...,
description="Unique invoice identifier",
examples=["INV-2024-001", "2024-INV-12345"] # Current format
)
# ❌ Bad - Outdated examples
document_no: str = Field(
...,
description="Unique invoice identifier",
examples=["INV-2020-001", "2020-INV-12345"] # Old format
)
4. Add Migration Guides¶
"""
Migration from v1.x to v2.0:
1. Rename fields:
- bill_no → document_no
- client → sent_to
2. Update types:
- date: str → date
3. Add required fields:
- payment_terms (default: "Net 30")
"""
Template Quality Checklist¶
Before Deployment¶
- All tests pass
- Template validated with sample documents
- Edge metadata verified
- Documentation complete
- Examples realistic and current
- No TODO or FIXME comments
- Code reviewed by team
- Performance tested with large documents
After Deployment¶
- Monitor extraction quality
- Collect feedback from users
- Track common extraction errors
- Update examples based on real data
- Refine descriptions based on LLM performance
- Version and document changes
Quick Start Template¶
Use this as a starting point for new templates:
"""
[Template Name] extraction template.
Extracts [key information] from [document type] documents.
Version: 1.0.0
Last Updated: [Date]
"""
from typing import Any, List, Optional
from pydantic import BaseModel, ConfigDict, Field, field_validator
# --- Edge Helper Function (REQUIRED) ---
def edge(label: str, **kwargs: Any) -> Any:
"""Helper to create graph edges."""
return Field(..., json_schema_extra={"edge_label": label}, **kwargs)
# --- Components ---
class Address(BaseModel):
"""Physical address component."""
model_config = ConfigDict(is_entity=False)
street: str = Field(
description="Street name and number",
examples=["123 Main St", "45 Rue de la Paix"]
)
city: str = Field(
description="City name",
examples=["Paris", "London"]
)
# --- Entities ---
class Organization(BaseModel):
"""Organization entity."""
model_config = ConfigDict(graph_id_fields=["name"])
name: str = Field(
description="Legal organization name",
examples=["Acme Corp", "Tech Solutions Ltd"]
)
located_at: Address = edge(
label="LOCATED_AT",
description="Organization's physical address"
)
# --- Root Document ---
class [DocumentName](BaseModel):
"""[Document type] document."""
model_config = ConfigDict(graph_id_fields=["document_id"])
document_id: str = Field(
description="Unique document identifier",
examples=["DOC-2024-001", "12345"]
)
issued_by: Organization = edge(
label="ISSUED_BY",
description="Organization that issued this document"
)
Next Steps¶
Congratulations! You've completed the Schema Definition guide. Now:
- Pipeline Configuration → - Configure extraction settings
- Examples - See complete working templates
- Extraction Process - Understand the extraction pipeline
Additional Resources¶
Documentation¶
- Pydantic Documentation - Official Pydantic docs
- Template Examples - Production-ready templates
- API Reference - Complete API documentation
Example Templates¶
- Invoice Template -
docs/examples/templates/billing_document.py - ID Card Template -
docs/examples/templates/id_card.py - Rheology Research Template -
docs/examples/templates/rheology_research.py - Insurance Template -
docs/examples/templates/insurance.py
Community¶
- GitHub Issues - Report bugs or request features
- Discussions - Ask questions and share templates
Final Checklist¶
Before moving to Pipeline Configuration, ensure:
- Template structure follows best practices
- All entities have appropriate
graph_id_fields - All components have
is_entity=False - Edge labels are consistent and descriptive
- Field descriptions are LLM-friendly
- Examples are realistic and diverse
- Validators ensure data quality
- Tests pass successfully
- Template tested with sample documents