Skip to content

Entities vs Components

Overview

The most critical decision when designing a Pydantic template is classifying each model as either an Entity or a Component. This distinction fundamentally affects how your knowledge graph is constructed and how nodes are deduplicated.

In this guide: - Understanding the Entity vs Component distinction - When to use each classification - How to configure graph_id_fields and is_entity - Real-world examples and decision trees


The Critical Distinction

Quick Comparison

Aspect Entity Component
Purpose Unique, identifiable objects Value objects, content-based deduplication
Configuration graph_id_fields=[...] is_entity=False
Deduplication By specified ID fields By all field values
When to Use Track individually (people, documents, organizations) Shared values (addresses, amounts, measurements)
Graph Behavior One node per unique ID combination One node per unique content combination

Visual Example

# Entity: Person (unique by name + DOB)
Person(first_name="John", last_name="Doe", dob="1990-01-01")
Person(first_name="John", last_name="Doe", dob="1990-01-01")
β†’ Creates 1 node (same ID fields)

Person(first_name="John", last_name="Doe", dob="1991-01-01")
β†’ Creates 2nd node (different DOB)

# Component: Address (unique by content)
Address(street="123 Main St", city="Paris")
Address(street="123 Main St", city="Paris")
β†’ Creates 1 node (identical content)

Address(street="123 Main St", city="London")
β†’ Creates 2nd node (different city)

Entities: Unique, Identifiable Objects

What is an Entity?

An Entity is a model that represents a unique, identifiable object that should be tracked individually in your knowledge graph. Entities are deduplicated based on specific identifying fields.

Configuration

class Person(BaseModel):
    """
    A person entity.
    Uniquely identified by first name, last name, and date of birth.
    """
    model_config = ConfigDict(graph_id_fields=["first_name", "last_name", "date_of_birth"])

    first_name: Optional[str] = Field(...)
    last_name: Optional[str] = Field(...)
    date_of_birth: Optional[date] = Field(...)
    email: Optional[str] = Field(None)  # Not part of ID
    phone: Optional[str] = Field(None)  # Not part of ID

Key Points: - Use graph_id_fields to specify which fields form the unique identifier - Only fields in graph_id_fields are used for deduplication - Other fields can vary without creating new nodes

When to Use Entities

Use entities for models that represent:

βœ… People - Individuals with unique identities

model_config = ConfigDict(graph_id_fields=["first_name", "last_name", "date_of_birth"])

βœ… Organizations - Companies, institutions

model_config = ConfigDict(graph_id_fields=["name"])

βœ… Documents - Invoices, contracts, reports

model_config = ConfigDict(graph_id_fields=["document_number"])

βœ… Products - Items with SKUs or unique identifiers

model_config = ConfigDict(graph_id_fields=["sku"])

βœ… Experiments - Research experiments with IDs

model_config = ConfigDict(graph_id_fields=["experiment_id"])

Choosing graph_id_fields

Select fields that: 1. Together form a natural unique identifier 2. Are stable (don't change frequently) 3. Are likely to be present in extracted data

Examples

# Single field ID
class Organization(BaseModel):
    model_config = ConfigDict(graph_id_fields=["name"])
    name: str = Field(...)

# Multi-field ID
class Person(BaseModel):
    model_config = ConfigDict(
        graph_id_fields=["first_name", "last_name", "date_of_birth"]
    )
    first_name: Optional[str] = Field(...)
    last_name: Optional[str] = Field(...)
    date_of_birth: Optional[date] = Field(...)

# Complex ID
class Measurement(BaseModel):
    model_config = ConfigDict(
        graph_id_fields=["name", "text_value", "numeric_value", "unit"]
    )
    name: str = Field(...)
    text_value: Optional[str] = Field(None)
    numeric_value: Optional[float] = Field(None)
    unit: Optional[str] = Field(None)

Entity Examples

πŸ“ Person Entity

class Person(BaseModel):
    """
    Person entity.
    Uniquely identified by name and date of birth.
    """
    model_config = ConfigDict(
        graph_id_fields=["first_name", "last_name", "date_of_birth"]
    )

    first_name: Optional[str] = Field(
        None,
        description="Person's given name",
        examples=["Jean", "Maria", "John"]
    )

    last_name: Optional[str] = Field(
        None,
        description="Person's family name",
        examples=["Dupont", "Garcia", "Smith"]
    )

    date_of_birth: Optional[date] = Field(
        None,
        description="Date of birth in YYYY-MM-DD format",
        examples=["1985-03-12", "1990-06-20"]
    )

    # These fields don't affect identity
    email: Optional[str] = Field(None)
    phone: Optional[str] = Field(None)

    def __str__(self) -> str:
        parts = [self.first_name, self.last_name]
        return " ".join(p for p in parts if p) or "Unknown"

Graph Behavior:

Person(first_name="John", last_name="Doe", dob="1990-01-01", email="john@email.com")
Person(first_name="John", last_name="Doe", dob="1990-01-01", email="john@work.com")
β†’ Same node (same ID fields, email difference ignored)

πŸ“ Document Entity

class BillingDocument(BaseModel):
    """
    BillingDocument document entity.
    Uniquely identified by invoice number.
    """
    model_config = ConfigDict(graph_id_fields=["document_no"])

    document_no: str = Field(
        ...,
        description="Unique invoice identifier",
        examples=["INV-2024-001", "12345"]
    )

    date: Optional[str] = Field(None)
    total: Optional[float] = Field(None)

    def __str__(self) -> str:
        return f"Invoice {self.document_no}"

Components: Value Objects

What is a Component?

A Component is a model that represents a value object - it's deduplicated by its entire content. If two components have identical field values, they share the same graph node.

Configuration

class Address(BaseModel):
    """
    Physical address component.
    Deduplicated by content - identical addresses share the same node.
    """
    model_config = ConfigDict(is_entity=False)

    street_address: Optional[str] = Field(...)
    city: Optional[str] = Field(...)
    postal_code: Optional[str] = Field(...)
    country: Optional[str] = Field(...)

Key Points: - Use is_entity=False to mark as component - All fields are used for deduplication - Identical content = same node

When to Use Components

Use components for models that represent:

βœ… Addresses - Physical locations

model_config = ConfigDict(is_entity=False)

βœ… Monetary Amounts - Values with currency

model_config = ConfigDict(is_entity=False)

βœ… Measurements - Quantities with units

model_config = ConfigDict(is_entity=False)

βœ… Dates/Times - Temporal values

model_config = ConfigDict(is_entity=False)

βœ… Coordinates - Geographic points

model_config = ConfigDict(is_entity=False)

Component Examples

πŸ“ Address Component

class Address(BaseModel):
    """
    Physical address component.
    Deduplicated by content - identical addresses share the same node.
    """
    model_config = ConfigDict(is_entity=False)

    street_address: Optional[str] = Field(
        None,
        description="Street name and number",
        examples=["123 Main Street", "45 Avenue des Champs-Γ‰lysΓ©es"]
    )

    city: Optional[str] = Field(
        None,
        description="City name",
        examples=["Paris", "London", "New York"]
    )

    postal_code: Optional[str] = Field(
        None,
        description="Postal or ZIP code",
        examples=["75001", "SW1A 1AA", "10001"]
    )

    country: Optional[str] = Field(
        None,
        description="Country name or code",
        examples=["France", "FR", "United Kingdom"]
    )

    def __str__(self) -> str:
        parts = [self.street_address, self.city, self.postal_code, self.country]
        return ", ".join(p for p in parts if p)

Graph Behavior:

Address(street="123 Main St", city="Paris", postal_code="75001")
Address(street="123 Main St", city="Paris", postal_code="75001")
β†’ Same node (identical content)

Address(street="123 Main St", city="Paris", postal_code="75002")
β†’ Different node (postal code differs)

πŸ“ Monetary Amount Component

class MonetaryAmount(BaseModel):
    """
    Monetary value with currency.
    Deduplicated by content - same value and currency share a node.
    """
    model_config = ConfigDict(is_entity=False)

    value: float = Field(
        ...,
        description="Numeric amount",
        examples=[500.00, 1250.50, 89.99]
    )

    currency: Optional[str] = Field(
        None,
        description="ISO 4217 currency code",
        examples=["EUR", "USD", "GBP"]
    )

    @field_validator("value")
    @classmethod
    def validate_positive(cls, v: Any) -> Any:
        if v < 0:
            raise ValueError("Amount must be non-negative")
        return v

    def __str__(self) -> str:
        return f"{self.value} {self.currency or ''}".strip()

Graph Behavior:

MonetaryAmount(value=100.00, currency="EUR")
MonetaryAmount(value=100.00, currency="EUR")
β†’ Same node (identical value and currency)

MonetaryAmount(value=100.00, currency="USD")
β†’ Different node (different currency)


Decision Tree

Use this decision tree to classify your models:

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart TD
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    A@{ shape: terminal, label: "New Model" }

    B{"Should this be<br/>tracked individually?"}
    C{"Does it have a<br/>natural unique ID?"}
    F{"Can you create<br/>a composite ID?"}
    G{"Is it a value<br/>that's shared?"}

    %% Outcomes
    D@{ shape: tag-proc, label: "Component<br/>is_entity=False" }
    E@{ shape: procs, label: "Entity<br/>graph_id_fields" }
    H@{ shape: lin-proc, label: "Consider redesigning<br/>or use content-based ID" }

    %% 3. Define Connections
    A --> B
    B -- Yes --> C
    B -- No --> D

    C -- Yes --> E
    C -- No --> F

    F -- Yes --> E
    F -- No --> G

    G -- Yes --> D
    G -- No --> H

    %% 4. Apply Classes
    class A input
    class B,C,F,G decision
    class E output
    class D data
    class H config

Questions to Ask

  1. "Should this be tracked individually?"
  2. Yes β†’ Likely an Entity
  3. No β†’ Likely a Component

  4. "If I see this twice with identical values, should it be one thing or two?"

  5. One thing β†’ Component
  6. Two things β†’ Entity

  7. "Does this represent a unique object or a shared value?"

  8. Unique object β†’ Entity
  9. Shared value β†’ Component

  10. "Would I want to query for all instances of this specific thing?"

  11. Yes β†’ Entity
  12. No β†’ Component

Real-World Examples

πŸ“ Invoice Processing

# ENTITY: BillingDocument (unique document)
class BillingDocument(BaseModel):
    model_config = ConfigDict(graph_id_fields=["document_no"])
    document_no: str = Field(...)
    # Each invoice is unique

# ENTITY: Organization (unique company)
class Organization(BaseModel):
    model_config = ConfigDict(graph_id_fields=["name"])
    name: str = Field(...)
    # Each organization is unique

# COMPONENT: Address (shared value)
class Address(BaseModel):
    model_config = ConfigDict(is_entity=False)
    street: str = Field(...)
    city: str = Field(...)
    # Multiple organizations can share the same address

# COMPONENT: MonetaryAmount (shared value)
class MonetaryAmount(BaseModel):
    model_config = ConfigDict(is_entity=False)
    value: float = Field(...)
    currency: str = Field(...)
    # Multiple invoices can have the same amount

Graph Structure:

BillingDocument-001 --ISSUED_BY--> Acme Corp --LOCATED_AT--> Address(123 Main St, Paris)
BillingDocument-002 --ISSUED_BY--> Tech Ltd --LOCATED_AT--> Address(123 Main St, Paris)
                                                      ↑ Same address node shared

πŸ“ Rheology Research

# ENTITY: Research (unique paper)
class Research(BaseModel):
    model_config = ConfigDict(graph_id_fields=["title"])
    title: str = Field(...)

# ENTITY: Experiment (unique experiment)
class Experiment(BaseModel):
    model_config = ConfigDict(graph_id_fields=["experiment_id"])
    experiment_id: str = Field(...)

# ENTITY: Material (unique material type)
class Material(BaseModel):
    model_config = ConfigDict(graph_id_fields=["material_type"])
    material_type: str = Field(...)

# COMPONENT: Measurement (shared value)
class Measurement(BaseModel):
    model_config = ConfigDict(is_entity=False)
    name: str = Field(...)
    value: float = Field(...)
    unit: str = Field(...)
    # Multiple experiments can have the same measurement

πŸ“ ID Card

# ENTITY: IDCard (unique document)
class IDCard(BaseModel):
    model_config = ConfigDict(graph_id_fields=["document_number"])
    document_number: str = Field(...)

# ENTITY: Person (unique individual)
class Person(BaseModel):
    model_config = ConfigDict(
        graph_id_fields=["given_names", "last_name", "date_of_birth"]
    )
    given_names: List[str] = Field(...)
    last_name: str = Field(...)
    date_of_birth: date = Field(...)

# COMPONENT: Address (shared value)
class Address(BaseModel):
    model_config = ConfigDict(is_entity=False)
    street_address: str = Field(...)
    city: str = Field(...)
    # Multiple people can live at the same address

Common Patterns

Pattern 1: Shared Addresses

Scenario: Multiple people or organizations at the same address.

Solution: Make Address a component.

class Address(BaseModel):
    """Component - shared by multiple entities."""
    model_config = ConfigDict(is_entity=False)
    # ...

class Person(BaseModel):
    """Entity - unique individual."""
    model_config = ConfigDict(graph_id_fields=["first_name", "last_name"])
    # ...
    addresses: List[Address] = edge(label="LIVES_AT", default_factory=list)

class Organization(BaseModel):
    """Entity - unique company."""
    model_config = ConfigDict(graph_id_fields=["name"])
    # ...
    addresses: List[Address] = edge(label="LOCATED_AT", default_factory=list)

Result: Same address node is shared across multiple people/organizations.

Pattern 2: Measurements in Research

Scenario: Multiple experiments report the same measurement value.

Solution: Make Measurement a component.

class Measurement(BaseModel):
    """Component - shared measurement value."""
    model_config = ConfigDict(is_entity=False)
    name: str = Field(...)
    value: float = Field(...)
    unit: str = Field(...)

class Experiment(BaseModel):
    """Entity - unique experiment."""
    model_config = ConfigDict(graph_id_fields=["experiment_id"])
    # ...
    measurements: List[Measurement] = Field(default_factory=list)

Pattern 3: Line Items

Scenario: Invoice line items - should each be unique or shared?

Decision: Usually neither - line items are typically embedded data, not separate nodes.

class LineItem(BaseModel):
    """Line item - embedded in invoice, not a separate node."""
    # No model_config needed - this won't become a node
    description: str = Field(...)
    quantity: float = Field(...)
    unit_price: float = Field(...)

class BillingDocument(BaseModel):
    """Entity - unique invoice."""
    model_config = ConfigDict(graph_id_fields=["document_no"])
    # ...
    # Use regular Field, not edge() - these are embedded
    items: List[LineItem] = Field(default_factory=list)

Line items as nodes

If you want line items as nodes, use edge() and decide if they're entities or components.


Common Mistakes

❌ Making Everything an Entity

# WRONG - Address as entity
class Address(BaseModel):
    model_config = ConfigDict(graph_id_fields=["street", "city"])
    # This creates separate nodes for identical addresses

Problem: Identical addresses create separate nodes, losing the benefit of shared locations.

Fix: Make Address a component.

❌ Making Everything a Component

# WRONG - Person as component
class Person(BaseModel):
    model_config = ConfigDict(is_entity=False)
    # This merges people with identical names

Problem: Two people with the same name become one node.

Fix: Make Person an entity with appropriate graph_id_fields.

❌ Wrong ID Fields

# WRONG - Using non-stable fields
class Person(BaseModel):
    model_config = ConfigDict(graph_id_fields=["email"])
    # Email can change, creating duplicate nodes

Problem: When email changes, a new node is created for the same person.

Fix: Use stable fields like name + date of birth.


Testing Your Classification

Test 1: Deduplication Behavior

# Test entity deduplication
person1 = Person(first_name="John", last_name="Doe", email="john@email.com")
person2 = Person(first_name="John", last_name="Doe", email="john@work.com")
# Should create 1 node (same ID fields)

# Test component deduplication
addr1 = Address(street="123 Main St", city="Paris")
addr2 = Address(street="123 Main St", city="Paris")
# Should create 1 node (identical content)

addr3 = Address(street="123 Main St", city="London")
# Should create 2nd node (different city)

Test 2: Graph Structure

Run extraction and check the graph:

uv run docling-graph convert document.pdf \
    --template "my_template.MyTemplate" \
    --export-format csv \
    --output-dir test_output

Check test_output/nodes.csv: - Entities should have one row per unique ID combination - Components should have one row per unique content combination


Next Steps

Now that you understand entities vs components:

  1. Field Definitions β†’ - Learn to write effective field descriptions
  2. Relationships - Connect entities and components with edges
  3. Advanced Patterns - Complex entity/component patterns