Best Practices and Checklist¶

Overview¶

This guide provides a comprehensive checklist and best practices for creating high-quality Pydantic templates. Use this as a final review before deploying your template.

In this guide: - Complete template checklist - Testing your template - Common pitfalls to avoid - Performance considerations - Maintenance tips

Complete Template Checklist¶

✅ Structure and Organization¶

Module docstring - Clear description of template purpose
Standard imports - All necessary imports included
Edge helper function - Defined correctly with exact signature
Logical organization - Components → Entities → Domain Models → Root
Model docstrings - All models have clear docstrings
Consistent naming - Follow Python naming conventions

✅ Entity Configuration¶

graph_id_fields defined - All entities have appropriate ID fields
ID fields are stable - Won't change frequently
ID fields are likely present - Will be extracted from documents
Composite IDs make sense - Multiple fields form natural unique identifier

✅ Component Configuration¶

is_entity=False set - All components marked correctly
Appropriate for sharing - Components represent value objects
Content-based deduplication - All fields used for uniqueness

✅ Field Definitions¶

Clear descriptions - LLM-friendly with extraction hints
Realistic examples - 2-5 diverse examples per field
Proper type hints - Optional, List, Union used correctly
Appropriate defaults - Required (...), None, or meaningful defaults
List fields use default_factory - Never use [] as default

✅ Edge Definitions¶

Descriptive labels - ALL_CAPS_WITH_UNDERSCORES format
Consistent naming - Same pattern across template
List edges have default_factory - Required for list relationships
Clear descriptions - Explain the relationship
Appropriate cardinality - Single vs list chosen correctly

✅ Validation¶

Field validators - Data quality checks where needed
Model validators - Cross-field validation implemented
Clear error messages - Specific, actionable errors
Handle None values - Validators allow None for optional fields
Pre-validators for normalization - mode='before' used appropriately

✅ String Representations¶

str methods - Defined for entities and key components
Handle None values - String methods don't crash on None
Meaningful output - Useful for debugging and logging

✅ Type Hints and Consistency¶

Proper type hints - All fields have correct types
Consistent patterns - Similar fields use similar patterns
No duplicate code - Reusable components extracted

Testing Your Template¶

Test 1: Basic Instantiation¶

# test_template_basic.py
from my_template import Document, Organization, Address

def test_basic_instantiation():
    """Test that models can be instantiated."""
    doc = Document(
        document_id="TEST-001",
        issued_by=Organization(
            name="Test Corp",
            located_at=Address(
                street="123 Test St",
                city="Paris"
            )
        )
    )
    assert doc.document_id == "TEST-001"
    assert doc.issued_by.name == "Test Corp"
    print("✅ Basic instantiation works")

if __name__ == "__main__":
    test_basic_instantiation()

Run with:

uv run python test_template_basic.py

Test 2: Validation¶

# test_template_validation.py
from my_template import MonetaryAmount
import pytest

def test_positive_amount():
    """Test that negative amounts are rejected."""
    with pytest.raises(ValueError, match="non-negative"):
        MonetaryAmount(value=-100, currency="EUR")
    print("✅ Validation works")

def test_valid_amount():
    """Test that positive amounts are accepted."""
    amount = MonetaryAmount(value=100, currency="EUR")
    assert amount.value == 100
    print("✅ Valid data accepted")

if __name__ == "__main__":
    test_positive_amount()
    test_valid_amount()

Run with:

uv run pytest test_template_validation.py -v

Test 3: Serialization¶

# test_template_serialization.py
from my_template import Document, Organization, Address
import json

def test_json_serialization():
    """Test that models can be serialized to JSON."""
    doc = Document(
        document_id="TEST-001",
        issued_by=Organization(
            name="Test Corp",
            located_at=Address(
                street="123 Test St",
                city="Paris"
            )
        )
    )

    # Serialize to JSON
    json_str = doc.model_dump_json(indent=2)
    print("✅ JSON serialization works")
    print(json_str)

    # Deserialize from JSON
    json_data = json.loads(json_str)
    doc2 = Document(**json_data)
    assert doc2.document_id == doc.document_id
    print("✅ JSON deserialization works")

if __name__ == "__main__":
    test_json_serialization()

Test 4: Edge Metadata¶

# test_template_edges.py
from my_template import Document

def test_edge_metadata():
    """Test that edge metadata is present."""
    # Get field info
    fields = Document.model_fields

    # Check issued_by has edge metadata
    issued_by_field = fields["issued_by"]
    metadata = issued_by_field.json_schema_extra

    assert metadata is not None
    assert "edge_label" in metadata
    assert metadata["edge_label"] == "ISSUED_BY"
    print("✅ Edge metadata present")

if __name__ == "__main__":
    test_edge_metadata()

Test 5: End-to-End Extraction¶

# test_template_extraction.py
"""Test template with actual extraction."""

def test_extraction():
    """Test extraction with a sample document."""
    import subprocess

    result = subprocess.run([
        "uv", "run", "docling-graph", "convert",
        "test_document.pdf",
        "--template", "my_template.Document",
        "--output-dir", "test_output",
        "--backend", "llm",
        "--inference", "local"
    ], capture_output=True, text=True)

    assert result.returncode == 0, f"Extraction failed: {result.stderr}"
    print("✅ End-to-end extraction works")

if __name__ == "__main__":
    test_extraction()

Common Pitfalls to Avoid¶

❌ Wrong edge() Definition¶

# WRONG - Missing **kwargs
def edge(label: str) -> Any:
    return Field(..., json_schema_extra={"edge_label": label})

# CORRECT
def edge(label: str, **kwargs: Any) -> Any:
    return Field(..., json_schema_extra={"edge_label": label}, **kwargs)

❌ Missing default_factory for Lists¶

# WRONG
items: List[Item] = edge(label="CONTAINS_LINE")

# CORRECT
items: List[Item] = edge(
    label="CONTAINS_LINE",
    default_factory=list
)

❌ Mutable Default Values¶

# WRONG - Shared mutable object
items: List[str] = Field([])

# CORRECT
items: List[str] = Field(default_factory=list)

❌ Vague Descriptions¶

# WRONG
name: str = Field(..., description="Name")

# CORRECT
name: str = Field(
    ...,
    description=(
        "Full legal name of the organization. "
        "Look for 'Company Name' or header text. "
        "Include legal suffixes like 'Ltd', 'Inc'."
    ),
    examples=["Acme Corp Ltd", "Tech Solutions Inc"]
)

❌ Inconsistent Edge Labels¶

# WRONG - Mixed formats
issued_by: Org = edge(label="issuedBy")
sent_to: Client = edge(label="SENT_TO")
has_items: List[Item] = edge(label="contains-item")

# CORRECT - Consistent ALL_CAPS_WITH_UNDERSCORES
issued_by: Org = edge(label="ISSUED_BY")
sent_to: Client = edge(label="SENT_TO")
has_items: List[Item] = edge(label="CONTAINS_LINE")

❌ Wrong Entity/Component Classification¶

# WRONG - Address as entity (creates duplicate nodes)
class Address(BaseModel):
    model_config = ConfigDict(graph_id_fields=["street", "city"])

# CORRECT - Address as component (shared nodes)
class Address(BaseModel):
    model_config = ConfigDict(is_entity=False)

❌ Unstable ID Fields¶

# WRONG - Email can change
class Person(BaseModel):
    model_config = ConfigDict(graph_id_fields=["email"])

# CORRECT - Stable fields
class Person(BaseModel):
    model_config = ConfigDict(
        graph_id_fields=["first_name", "last_name", "date_of_birth"]
    )

❌ Missing Validators¶

# WRONG - No validation
currency: str = Field(...)

# CORRECT - Validated
currency: str = Field(...)

@field_validator("currency")
@classmethod
def validate_currency(cls, v: Any) -> Any:
    if v and not (len(v) == 3 and v.isupper()):
        raise ValueError("Currency must be 3 uppercase letters")
    return v

Performance Considerations¶

1. Keep Templates Focused¶

# ✅ Good - Focused template
class BillingDocument(BaseModel):
    """BillingDocument document."""
    # Only invoice-related fields

# ❌ Bad - Kitchen sink template
class Document(BaseModel):
    """Generic document."""
    # Hundreds of fields for every document type

2. Use Appropriate Validators¶

# ✅ Good - Simple validation
@field_validator("value")
@classmethod
def validate_positive(cls, v: Any) -> Any:
    if v < 0:
        raise ValueError("Must be non-negative")
    return v

# ❌ Bad - Complex validation in validator
@field_validator("value")
@classmethod
def validate_complex(cls, v: Any) -> Any:
    # Expensive database lookup
    # Complex calculations
    # Multiple API calls
    return v

3. Minimize Nested Depth¶

# ✅ Good - Reasonable nesting (2-3 levels)
Invoice → LineItem → MonetaryAmount

# ❌ Bad - Excessive nesting (5+ levels)
Document → Section → Subsection → Paragraph → Sentence → Word

Maintenance Tips¶

1. Version Your Templates¶

"""
BillingDocument extraction template.

Version: 2.0.0
Last Updated: 2024-01-15
Changes:
  - Added payment_terms field
  - Updated Organization to include tax_id
  - Fixed email validation
"""

2. Document Breaking Changes¶

"""
BREAKING CHANGES in v2.0.0:
- Renamed 'bill_no' to 'document_no'
- Changed 'date' from str to date type
- Removed deprecated 'legacy_field'
"""

3. Keep Examples Updated¶

# ✅ Good - Current examples
document_no: str = Field(
    ...,
    description="Unique invoice identifier",
    examples=["INV-2024-001", "2024-INV-12345"]  # Current format
)

# ❌ Bad - Outdated examples
document_no: str = Field(
    ...,
    description="Unique invoice identifier",
    examples=["INV-2020-001", "2020-INV-12345"]  # Old format
)

4. Add Migration Guides¶

"""
Migration from v1.x to v2.0:

1. Rename fields:
   - bill_no → document_no
   - client → sent_to

2. Update types:
   - date: str → date

3. Add required fields:
   - payment_terms (default: "Net 30")
"""

Template Quality Checklist¶

Before Deployment¶

After Deployment¶

Monitor extraction quality
Collect feedback from users
Track common extraction errors
Update examples based on real data
Refine descriptions based on LLM performance
Version and document changes

Quick Start Template¶

Use this as a starting point for new templates:

"""
[Template Name] extraction template.

Extracts [key information] from [document type] documents.

Version: 1.0.0
Last Updated: [Date]
"""

from typing import Any, List, Optional
from pydantic import BaseModel, ConfigDict, Field, field_validator

# --- Edge Helper Function (REQUIRED) ---
def edge(label: str, **kwargs: Any) -> Any:
    """Helper to create graph edges."""
    return Field(..., json_schema_extra={"edge_label": label}, **kwargs)

# --- Components ---
class Address(BaseModel):
    """Physical address component."""
    model_config = ConfigDict(is_entity=False)

    street: str = Field(
        description="Street name and number",
        examples=["123 Main St", "45 Rue de la Paix"]
    )
    city: str = Field(
        description="City name",
        examples=["Paris", "London"]
    )

# --- Entities ---
class Organization(BaseModel):
    """Organization entity."""
    model_config = ConfigDict(graph_id_fields=["name"])

    name: str = Field(
        description="Legal organization name",
        examples=["Acme Corp", "Tech Solutions Ltd"]
    )

    located_at: Address = edge(
        label="LOCATED_AT",
        description="Organization's physical address"
    )

# --- Root Document ---
class [DocumentName](BaseModel):
    """[Document type] document."""
    model_config = ConfigDict(graph_id_fields=["document_id"])

    document_id: str = Field(
        description="Unique document identifier",
        examples=["DOC-2024-001", "12345"]
    )

    issued_by: Organization = edge(
        label="ISSUED_BY",
        description="Organization that issued this document"
    )

Next Steps¶

Congratulations! You've completed the Schema Definition guide. Now:

Pipeline Configuration → - Configure extraction settings
Examples - See complete working templates
Extraction Process - Understand the extraction pipeline

Additional Resources¶

Documentation¶

Pydantic Documentation - Official Pydantic docs
Template Examples - Production-ready templates
API Reference - Complete API documentation

Example Templates¶

Invoice Template - docs/examples/templates/billing_document.py
ID Card Template - docs/examples/templates/id_card.py
Rheology Research Template - docs/examples/templates/rheology_research.py
Insurance Template - docs/examples/templates/insurance.py

Community¶

GitHub Issues - Report bugs or request features
Discussions - Ask questions and share templates

Final Checklist¶

Before moving to Pipeline Configuration, ensure:

Template structure follows best practices
All entities have appropriate graph_id_fields
All components have is_entity=False
Edge labels are consistent and descriptive
Field descriptions are LLM-friendly
Examples are realistic and diverse
Validators ensure data quality
Tests pass successfully
Template tested with sample documents