Validation and Normalization¶

Overview¶

Validators ensure data quality and consistency in your extracted data. Pydantic provides powerful validation mechanisms that can transform, normalize, and validate field values before they're stored in your knowledge graph.

In this guide: - Field validators for single-field validation - Model validators for cross-field validation - Pre-validators for data transformation - Common validation patterns - Normalization helpers

Field Validators¶

Basic Field Validator¶

Use @field_validator to validate individual fields:

from pydantic import BaseModel, Field, field_validator
from typing import Any

class MonetaryAmount(BaseModel):
    """Monetary value with validation."""
    model_config = ConfigDict(is_entity=False)

    value: float = Field(...)
    currency: Optional[str] = Field(None)

    @field_validator("value")
    @classmethod
    def validate_positive(cls, v: Any) -> Any:
        """Ensure value is non-negative."""
        if v < 0:
            raise ValueError("Monetary amount must be non-negative")
        return v

Validator Anatomy¶

@field_validator("field_name")  # Field to validate
@classmethod  # Must be classmethod
def validator_name(cls, v: Any) -> Any:  # Takes value, returns value
    """Docstring explaining validation."""
    # Validation logic
    if not valid:
        raise ValueError("Error message")
    return v  # Return (possibly modified) value

Pre-Validators (mode='before')¶

When to Use Pre-Validators¶

Use mode='before' to transform input before type coercion:

@field_validator("email", mode="before")
@classmethod
def normalize_email(cls, v: Any) -> Any:
    """Convert email to lowercase and strip whitespace."""
    if v:
        return v.lower().strip()
    return v

Use cases: - Normalizing strings (lowercase, strip whitespace) - Converting types (string to list) - Parsing complex formats - Cleaning input data

Pre-Validator Examples¶

📍 Email Normalization¶

class Person(BaseModel):
    """Person with normalized email."""

    email: Optional[str] = Field(None)

    @field_validator("email", mode="before")
    @classmethod
    def normalize_email(cls, v: Any) -> Any:
        """Convert email to lowercase and strip whitespace."""
        if v:
            return v.lower().strip()
        return v

Input/Output:

Person(email="  John.Doe@EMAIL.COM  ")
# Result: email="john.doe@email.com"

📍 String to List Conversion¶

class Person(BaseModel):
    """Person with flexible name input."""

    given_names: List[str] = Field(default_factory=list)

    @field_validator("given_names", mode="before")
    @classmethod
    def ensure_list(cls, v: Any) -> Any:
        """Ensure given_names is always a list."""
        if isinstance(v, str):
            # Handle comma-separated names
            if "," in v:
                return [name.strip() for name in v.split(",")]
            return [v]
        return v

Input/Output:

Person(given_names="John, Paul, George")
# Result: given_names=["John", "Paul", "George"]

Person(given_names="John")
# Result: given_names=["John"]

Person(given_names=["John", "Paul"])
# Result: given_names=["John", "Paul"]

📍 Phone Number Cleaning¶

class Contact(BaseModel):
    """Contact with cleaned phone number."""

    phone: Optional[str] = Field(None)

    @field_validator("phone", mode="before")
    @classmethod
    def clean_phone(cls, v: Any) -> Any:
        """Remove non-numeric characters except + and spaces."""
        if v:
            # Keep only digits, +, and spaces
            import re
            return re.sub(r'[^\d\s+]', '', v)
        return v

Input/Output:

Contact(phone="+33 (0)1-23-45-67-89")
# Result: phone="+33 01 23 45 67 89"

Post-Validators (Default Mode)¶

When to Use Post-Validators¶

Use default mode (or mode='after') to validate after type coercion:

@field_validator("currency")
@classmethod
def validate_currency_format(cls, v: Any) -> Any:
    """Ensure currency is 3 uppercase letters (ISO 4217)."""
    if v and not (len(v) == 3 and v.isupper()):
        raise ValueError("Currency must be 3 uppercase letters (ISO 4217)")
    return v

Use cases: - Validating format constraints - Checking value ranges - Enforcing business rules - Verifying data integrity

Post-Validator Examples¶

📍 Currency Code Validation¶

class MonetaryAmount(BaseModel):
    """Monetary amount with validated currency."""
    model_config = ConfigDict(is_entity=False)

    value: float = Field(...)
    currency: Optional[str] = Field(None)

    @field_validator("currency")
    @classmethod
    def validate_currency_format(cls, v: Any) -> Any:
        """Ensure currency is 3 uppercase letters."""
        if v and not (len(v) == 3 and v.isupper()):
            raise ValueError("Currency must be 3 uppercase letters (ISO 4217)")
        return v

📍 Range Validation¶

class Product(BaseModel):
    """Product with validated quantity."""

    quantity: int = Field(...)

    @field_validator("quantity")
    @classmethod
    def validate_quantity_range(cls, v: Any) -> Any:
        """Ensure quantity is between 1 and 10000."""
        if v < 1:
            raise ValueError("Quantity must be at least 1")
        if v > 10000:
            raise ValueError("Quantity cannot exceed 10000")
        return v

📍 Email Format Validation¶

class Contact(BaseModel):
    """Contact with validated email."""

    email: Optional[str] = Field(None)

    @field_validator("email")
    @classmethod
    def validate_email_format(cls, v: Any) -> Any:
        """Basic email format validation."""
        if v and "@" not in v:
            raise ValueError("Invalid email format")
        return v

Model Validators¶

When to Use Model Validators¶

Use @model_validator for cross-field validation - when validation depends on multiple fields:

from pydantic import model_validator
from typing_extensions import Self

class Measurement(BaseModel):
    """Measurement with cross-field validation."""
    model_config = ConfigDict(is_entity=False)

    numeric_value: Optional[float] = Field(None)
    numeric_value_min: Optional[float] = Field(None)
    numeric_value_max: Optional[float] = Field(None)

    @model_validator(mode="after")
    def validate_value_consistency(self) -> Self:
        """Ensure value fields are used consistently."""
        has_single = self.numeric_value is not None
        has_min = self.numeric_value_min is not None
        has_max = self.numeric_value_max is not None

        if has_single and has_min and has_max:
            raise ValueError(
                "Cannot specify numeric_value, numeric_value_min, "
                "and numeric_value_max simultaneously"
            )

        return self

Model Validator Examples¶

📍 Date Range Validation¶

from datetime import date

class Event(BaseModel):
    """Event with validated date range."""

    start_date: Optional[date] = Field(None)
    end_date: Optional[date] = Field(None)

    @model_validator(mode="after")
    def validate_date_range(self) -> Self:
        """Ensure end_date is after start_date."""
        if self.start_date and self.end_date:
            if self.end_date < self.start_date:
                raise ValueError("end_date must be after start_date")
        return self

📍 Conditional Required Fields¶

class Document(BaseModel):
    """Document with conditional validation."""

    document_type: str = Field(...)
    document_no: Optional[str] = Field(None)
    receipt_number: Optional[str] = Field(None)

    @model_validator(mode="after")
    def validate_document_numbers(self) -> Self:
        """Ensure appropriate number field is present."""
        if self.document_type == "invoice" and not self.document_no:
            raise ValueError("document_no required for invoice documents")
        if self.document_type == "receipt" and not self.receipt_number:
            raise ValueError("receipt_number required for receipt documents")
        return self

📍 Mutual Exclusivity¶

class Payment(BaseModel):
    """Payment with mutually exclusive fields."""

    cash_amount: Optional[float] = Field(None)
    card_amount: Optional[float] = Field(None)
    check_amount: Optional[float] = Field(None)

    @model_validator(mode="after")
    def validate_single_payment_method(self) -> Self:
        """Ensure only one payment method is used."""
        methods = [
            self.cash_amount is not None,
            self.card_amount is not None,
            self.check_amount is not None
        ]
        if sum(methods) > 1:
            raise ValueError("Only one payment method can be specified")
        if sum(methods) == 0:
            raise ValueError("At least one payment method must be specified")
        return self

Common Validation Patterns¶

Pattern 1: Positive Number Validation¶

@field_validator("amount", "quantity", "price")
@classmethod
def validate_positive(cls, v: Any) -> Any:
    """Ensure value is positive."""
    if v is not None and v < 0:
        raise ValueError(f"Value must be non-negative, got {v}")
    return v

Pattern 2: String Length Validation¶

@field_validator("postal_code")
@classmethod
def validate_postal_code_length(cls, v: Any) -> Any:
    """Ensure postal code is 5 digits."""
    if v and len(v) != 5:
        raise ValueError("Postal code must be 5 digits")
    return v

Pattern 3: Enum-like Validation¶

@field_validator("status")
@classmethod
def validate_status(cls, v: Any) -> Any:
    """Ensure status is one of allowed values."""
    allowed = ["pending", "approved", "rejected"]
    if v and v not in allowed:
        raise ValueError(f"Status must be one of {allowed}")
    return v

Pattern 4: Pattern Matching¶

import re

@field_validator("email")
@classmethod
def validate_email_pattern(cls, v: Any) -> Any:
    """Validate email format using regex."""
    if v:
        pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
        if not re.match(pattern, v):
            raise ValueError("Invalid email format")
    return v

Enum Normalization Helper¶

The Problem¶

Enums can be tricky with LLM extraction - the model might return various formats:

from enum import Enum

class Status(str, Enum):
    PENDING = "Pending"
    APPROVED = "Approved"
    REJECTED = "Rejected"

# LLM might return: "pending", "PENDING", "Pending", "approved", etc.

The Solution¶

Use a normalization helper:

import re
from enum import Enum
from typing import Type, Any

def _normalize_enum(enum_cls: Type[Enum], v: Any) -> Any:
    """
    Accept enum instances, value strings, or member names.
    Handles various formats: 'VALUE', 'value', 'Value', 'VALUE_NAME'.
    Falls back to 'OTHER' member if present.
    """
    if isinstance(v, enum_cls):
        return v

    if isinstance(v, str):
        # Normalize to alphanumeric lowercase
        key = re.sub(r"[^A-Za-z0-9]+", "", v).lower()

        # Build mapping of normalized names/values to enum members
        mapping = {}
        for member in enum_cls:
            normalized_name = re.sub(r"[^A-Za-z0-9]+", "", member.name).lower()
            normalized_value = re.sub(r"[^A-Za-z0-9]+", "", member.value).lower()
            mapping[normalized_name] = member
            mapping[normalized_value] = member

        if key in mapping:
            return mapping[key]

        # Last attempt: direct value match
        try:
            return enum_cls(v)
        except Exception:
            # Safe fallback to OTHER if present
            if "OTHER" in enum_cls.__members__:
                return enum_cls.OTHER
            raise

    raise ValueError(f"Cannot normalize {v} to {enum_cls}")

Usage Example¶

class DocumentType(str, Enum):
    INVOICE = "Invoice"
    RECEIPT = "Receipt"
    CREDIT_NOTE = "Credit Note"
    DEBIT_NOTE = "Debit Note"
    PRO_FORMA = "Pro Forma"
    OTHER = "Other"

class Document(BaseModel):
    """Document with normalized enum."""

    document_type: DocumentType = Field(...)

    @field_validator("document_type", mode="before")
    @classmethod
    def normalize_document_type(cls, v: Any) -> Any:
        return _normalize_enum(DocumentType, v)

Handles all these inputs:

Document(document_type="invoice")  # → DocumentType.INVOICE
Document(document_type="INVOICE")  # → DocumentType.INVOICE
Document(document_type="Invoice")  # → DocumentType.INVOICE
Document(document_type="credit note")  # → DocumentType.CREDIT_NOTE
Document(document_type="unknown")  # → DocumentType.OTHER (fallback)

Measurement Parsing Helper¶

The Problem¶

LLMs might return measurements in various formats:

"1.6 mPa.s"
"2 mm"
"80-90 °C"
"High"

The Solution¶

Use a parsing helper:

import re
from typing import Any, Optional

def _parse_measurement_string(
    s: str,
    default_name: Optional[str] = None,
    strict: bool = False
) -> dict[str, Any]:
    """
    Parse measurement strings into structured dict.

    Examples:
        "1.6 mPa.s" → {numeric_value: 1.6, unit: "mPa.s"}
        "80-90 °C" → {numeric_value_min: 80, numeric_value_max: 90, unit: "°C"}
        "High" → {text_value: "High"}
    """
    if not isinstance(s, str):
        return s

    # Try to parse range (e.g., "80-90 °C")
    range_match = re.match(
        r"^\s*([+-]?\d+(?:\.\d+)?)\s*-\s*([+-]?\d+(?:\.\d+)?)\s*([^\d]+)?$",
        s
    )
    if range_match:
        min_val = float(range_match.group(1))
        max_val = float(range_match.group(2))
        unit = (range_match.group(3) or "").strip() or None
        return {
            "name": default_name or "Value",
            "numeric_value": None,
            "numeric_value_min": min_val,
            "numeric_value_max": max_val,
            "text_value": None,
            "unit": unit,
        }

    # Try to parse single value (e.g., "1.6 mPa.s")
    single_match = re.match(r"^\s*([+-]?\d+(?:\.\d+)?)\s*([^\d]+)?$", s)
    if single_match:
        num = float(single_match.group(1))
        unit = (single_match.group(2) or "").strip() or None
        return {
            "name": default_name or "Value",
            "numeric_value": num,
            "numeric_value_min": None,
            "numeric_value_max": None,
            "text_value": None,
            "unit": unit,
        }

    # No numeric part found
    if strict:
        raise ValueError(f"Cannot parse '{s}' as measurement")

    # Fallback: keep raw as text
    return {
        "name": default_name or "Value",
        "numeric_value": None,
        "numeric_value_min": None,
        "numeric_value_max": None,
        "text_value": s.strip(),
        "unit": None,
    }

Usage Example¶

class Measurement(BaseModel):
    """Flexible measurement model."""
    model_config = ConfigDict(is_entity=False)

    name: str = Field(...)
    numeric_value: Optional[float] = Field(None)
    numeric_value_min: Optional[float] = Field(None)
    numeric_value_max: Optional[float] = Field(None)
    text_value: Optional[str] = Field(None)
    unit: Optional[str] = Field(None)

    @field_validator("numeric_value", "numeric_value_min", "numeric_value_max", mode="before")
    @classmethod
    def parse_if_string(cls, v: Any, info: ValidationInfo) -> Any:
        """Parse measurement strings."""
        if isinstance(v, str):
            field_name = info.field_name
            parsed = _parse_measurement_string(v, default_name=field_name)
            return parsed.get(field_name)
        return v

Best Practices¶

👍 Validate Early¶

Use mode='before' for normalization, default mode for validation:

@field_validator("email", mode="before")
@classmethod
def normalize_email(cls, v: Any) -> Any:
    """Normalize before validation."""
    if v:
        return v.lower().strip()
    return v

@field_validator("email")
@classmethod
def validate_email(cls, v: Any) -> Any:
    """Validate after normalization."""
    if v and "@" not in v:
        raise ValueError("Invalid email")
    return v

👍 Provide Clear Error Messages¶

# ✅ Good - Specific error message
@field_validator("quantity")
@classmethod
def validate_quantity(cls, v: Any) -> Any:
    if v < 1:
        raise ValueError(f"Quantity must be at least 1, got {v}")
    return v

# ❌ Bad - Vague error message
@field_validator("quantity")
@classmethod
def validate_quantity(cls, v: Any) -> Any:
    if v < 1:
        raise ValueError("Invalid quantity")
    return v

👍 Handle None Values¶

@field_validator("email")
@classmethod
def validate_email(cls, v: Any) -> Any:
    """Validate email, allowing None."""
    if v is None:
        return v  # Allow None for optional fields
    if "@" not in v:
        raise ValueError("Invalid email")
    return v

👍 Use Type Guards¶

@field_validator("value", mode="before")
@classmethod
def coerce_to_float(cls, v: Any) -> Any:
    """Convert string to float if needed."""
    if isinstance(v, str):
        try:
            return float(v.replace(",", ""))
        except ValueError:
            raise ValueError(f"Cannot convert '{v}' to float")
    return v

Graceful Error Handling¶

The Problem with Strict Validators¶

Strict validators that raise ValueError on invalid data can cause complete extraction failure:

# ❌ Strict validator - causes extraction failure
@field_validator("value")
@classmethod
def validate_positive(cls, v: Any) -> Any:
    """Ensure amount is non-negative."""
    if v < 0:
        raise ValueError(f"Monetary amount must be non-negative, got {v}")
    return v

What happens: - LLM extracts: allowance_total: -258.12 (negative because it's a discount) - Validator rejects: "Monetary amount must be non-negative" - Result: Entire extraction fails, losing ALL extracted data

The Solution: Lenient Validators¶

Lenient validators coerce invalid values instead of rejecting them:

# ✅ Lenient validator - coerces instead of rejecting
import logging

logger = logging.getLogger(__name__)

@field_validator("value", mode="before")
@classmethod
def coerce_positive(cls, v: Any) -> Any:
    """
    Coerce negative values to positive (use absolute value).

    Allowances and discounts are often represented as negative in accounting,
    but should be stored as positive amounts. The charge_indicator field
    (in AllowanceCharge) indicates direction: false=allowance, true=charge.

    This validator is lenient - it coerces instead of rejecting to prevent
    extraction failures due to semantic differences in how amounts are represented.
    """
    if isinstance(v, (int, float)) and v < 0:
        logger.warning(
            f"Negative monetary value {v} coerced to positive {abs(v)}. "
            "Allowances/discounts should be positive amounts."
        )
        return abs(v)
    return v

Benefits: - ✅ Extraction succeeds even with "invalid" data - ✅ Data quality issues are logged for review - ✅ 99% correct data is preserved instead of lost - ✅ Semantic differences are handled gracefully

When to Use Lenient Validators¶

Use lenient validators for:

Semantic Variations
Negative amounts for discounts/allowances
Lowercase currency codes (normalize to uppercase)
Different date formats (parse and normalize)
Common LLM Mistakes
Missing spaces in addresses
Wrong case in enums
Currency symbols instead of codes
Non-Critical Validation
Format preferences (3-letter currency codes)
Range constraints (quantity > 0)
Pattern matching (email format)

When to Use Strict Validators¶

Use strict validators only for:

Critical Data Integrity
Required fields that must be present
Type safety (must be a number, not a string)
Business rules that cannot be violated
Security Concerns
SQL injection prevention
Path traversal prevention
XSS prevention

Lenient Validator Patterns¶

Pattern 1: Coerce Negative to Positive¶

@field_validator("value", mode="before")
@classmethod
def coerce_positive(cls, v: Any) -> Any:
    """Coerce negative values to positive."""
    if isinstance(v, (int, float)) and v < 0:
        logger.warning(f"Negative value {v} coerced to {abs(v)}")
        return abs(v)
    return v

Pattern 2: Normalize Case¶

@field_validator("currency", mode="before")
@classmethod
def normalize_currency(cls, v: Any) -> Any:
    """Normalize currency to uppercase."""
    if v:
        v_upper = str(v).strip().upper()
        if len(v_upper) == 3 and v_upper.isalpha():
            return v_upper
        logger.warning(f"Currency '{v}' normalized to '{v_upper}'")
        return v_upper
    return v

Pattern 3: Handle Zero Values¶

@field_validator("quantity", mode="before")
@classmethod
def handle_zero(cls, v: Any) -> Any:
    """Handle zero quantities by setting default."""
    if isinstance(v, (int, float)):
        if v == 0:
            logger.warning("Zero quantity detected, setting to 1 as default")
            return 1.0
        elif v < 0:
            logger.warning(f"Negative quantity {v} coerced to {abs(v)}")
            return abs(v)
    return v

Pattern 4: Symbol to Code Conversion¶

@field_validator("currency", mode="before")
@classmethod
def convert_symbol(cls, v: Any) -> Any:
    """Convert currency symbols to ISO codes."""
    symbol_map = {
        "€": "EUR",
        "$": "USD",
        "£": "GBP",
        "¥": "JPY",
    }

    if v in symbol_map:
        logger.info(f"Currency symbol '{v}' converted to '{symbol_map[v]}'")
        return symbol_map[v]

    return v

Logging Best Practices¶

Always log data quality issues:

import logging

logger = logging.getLogger(__name__)

@field_validator("value", mode="before")
@classmethod
def coerce_positive(cls, v: Any) -> Any:
    """Coerce with logging."""
    if isinstance(v, (int, float)) and v < 0:
        # Log at WARNING level for data quality issues
        logger.warning(
            f"Data quality issue: Negative value {v} coerced to {abs(v)}. "
            f"Field: {cls.__name__}.value"
        )
        return abs(v)
    return v

Log Levels: - logger.info() - Normal coercion (e.g., lowercase → uppercase) - logger.warning() - Data quality issues (e.g., negative → positive) - logger.error() - Serious issues that couldn't be fixed

Complete Example: Lenient MonetaryAmount¶

import logging
from typing import Any
from pydantic import BaseModel, ConfigDict, Field, field_validator

logger = logging.getLogger(__name__)

class MonetaryAmount(BaseModel):
    """Monetary amount with lenient validation."""
    model_config = ConfigDict(is_entity=False)

    value: float = Field(
        ...,
        description="Monetary amount (always positive)",
        examples=[100.00, 1250.50, 89.99]
    )

    currency: str | None = Field(
        None,
        description="ISO 4217 currency code (3 uppercase letters)",
        examples=["EUR", "USD", "GBP", "CHF"]
    )

    @field_validator("value", mode="before")
    @classmethod
    def coerce_positive(cls, v: Any) -> Any:
        """Coerce negative values to positive."""
        if isinstance(v, (int, float)) and v < 0:
            logger.warning(
                f"Negative monetary value {v} coerced to positive {abs(v)}. "
                "Allowances/discounts should be positive amounts."
            )
            return abs(v)
        return v

    @field_validator("currency", mode="before")
    @classmethod
    def normalize_currency(cls, v: Any) -> Any:
        """Normalize currency to ISO 4217 format."""
        if not v:
            return v

        # Symbol to code mapping
        symbol_map = {
            "€": "EUR", "$": "USD", "£": "GBP", "¥": "JPY",
            "₹": "INR", "₽": "RUB", "₩": "KRW", "₪": "ILS",
        }

        v_str = str(v).strip()

        # Convert symbol to code
        if v_str in symbol_map:
            return symbol_map[v_str]

        # Normalize to uppercase
        v_upper = v_str.upper()

        # Validate format
        if len(v_upper) == 3 and v_upper.isalpha():
            return v_upper

        # Log warning but don't fail
        logger.warning(
            f"Currency '{v}' does not match ISO 4217 format. "
            f"Normalized to '{v_upper}' but may be invalid."
        )
        return v_upper if len(v_upper) == 3 else v_str

Migration Guide: Strict → Lenient¶

Before (Strict):

@field_validator("value")
@classmethod
def validate_positive(cls, v: Any) -> Any:
    if v < 0:
        raise ValueError(f"Must be non-negative, got {v}")
    return v

After (Lenient):

@field_validator("value", mode="before")
@classmethod
def coerce_positive(cls, v: Any) -> Any:
    if isinstance(v, (int, float)) and v < 0:
        logger.warning(f"Negative value {v} coerced to {abs(v)}")
        return abs(v)
    return v

Changes: 1. Add mode="before" to validator decorator 2. Replace raise ValueError with coercion logic 3. Add logger.warning() for data quality tracking 4. Add type guard (isinstance) for safety 5. Update docstring to explain lenient behavior

Testing Validators¶

Test Individual Validators¶

# test_validators.py
from my_template import MonetaryAmount
import pytest

def test_positive_amount():
    """Test that negative amounts are rejected."""
    with pytest.raises(ValueError, match="non-negative"):
        MonetaryAmount(value=-100, currency="EUR")

def test_valid_amount():
    """Test that positive amounts are accepted."""
    amount = MonetaryAmount(value=100, currency="EUR")
    assert amount.value == 100

Test with uv¶

uv run pytest test_validators.py -v

Next Steps¶

Now that you understand validation:

Advanced Patterns → - Complex validation patterns
Best Practices - Complete template checklist
Examples - See validators in action