Core Concepts#
The DQ Validator module is built around several core concepts that work together to provide comprehensive data quality validation.
Note
This section provides an overview of core concepts. Detailed API documentation is available in the API Reference.
Metadata#
Metadata defines the structure of your data and is the foundation of validation.
AssetMetadata#
Represents a data asset (table) with its columns:
from wxdi.dq_validator import AssetMetadata, ColumnMetadata, DataType
metadata = AssetMetadata(
table_name="customers",
columns=[
ColumnMetadata(name="customer_id", data_type=DataType.INTEGER, position=0),
ColumnMetadata(name="email", data_type=DataType.STRING, position=1),
ColumnMetadata(name="age", data_type=DataType.INTEGER, position=2)
]
)
ColumnMetadata#
Defines individual column properties:
name: Column namedata_type: Expected data typeposition: Position in the record array (0-based)
Validator#
The main validation engine that applies rules to records.
from wxdi.dq_validator import Validator
validator = Validator(metadata)
validator.add_rule(rule1)
validator.add_rule(rule2)
result = validator.validate(record)
ValidationRule#
Defines validation logic for a specific column.
from wxdi.dq_validator import ValidationRule
from wxdi.dq_validator.checks import CompletenessCheck, LengthCheck
rule = ValidationRule("email")
rule.add_check(CompletenessCheck())
rule.add_check(LengthCheck(min_length=5, max_length=100))
Validation Checks#
Individual validation checks that can be added to rules. See Validation Checks for details.
ValidationResult#
Contains the results of validating a single record:
result = validator.validate(record)
print(f"Score: {result.validation_score}")
print(f"Pass Rate: {result.pass_rate}%")
print(f"Errors: {len(result.errors)}")
Data Quality Dimensions#
Validation checks are categorized by 8 standard data quality dimensions:
Accuracy: Data correctly represents real-world values
Completeness: Required data is present
Conformity: Data conforms to specified formats
Consistency: Data is consistent across systems
Coverage: Data covers the required scope
Timeliness: Data is up-to-date
Uniqueness: Data has no duplicates where required
Validity: Data is valid according to business rules
For more information, see the API Reference.