Features#
The IBM watsonx.data intelligence SDK provides a comprehensive set of features organized into modular components.
Common Modules#
Authentication#
Unified authentication framework supporting multiple environments:
IBM Cloud: IAM authentication with API keys
AWS Cloud: AWS-specific authentication protocols
Government Cloud: Specialized authentication for government deployments
On-Premises: Username/password and Zen API key authentication
Key features:
Automatic token management and refresh
Thread-safe session handling
SSL verification control for on-premises deployments
Type-safe configuration with full validation
See Authentication for detailed usage.
DQ Validator Module#
The Data Quality Validator module provides comprehensive in-memory validation capabilities for streaming data and DataFrames.
Core Validation Engine#
Array-based Records: Optimized for streaming data where records are arrays of values
Metadata-driven: Define table structure and column mappings once, reuse across validations
Fluent API: Chainable method calls for intuitive rule definition
Score-based Results: Each validation returns detailed scores, pass rates, and error details
Type Safety: Full type hints throughout for better IDE support
Data Quality Dimensions#
Track validation checks by 8 standard data quality dimensions:
Accuracy: Data correctly represents real-world values
Completeness: Required data is present
Conformity: Data conforms to specified formats
Consistency: Data is consistent across systems
Coverage: Data covers the required scope
Timeliness: Data is up-to-date
Uniqueness: Data has no duplicates where required
Validity: Data is valid according to business rules
Validation Checks#
Nine comprehensive validation check types:
LengthCheck: Validates string length (min, max, exact)
ValidValuesCheck: Validates against allowed list with case-insensitive option
ComparisonCheck: Compares values using operators (==, !=, <, >, <=, >=)
CaseCheck: Validates character case (upper, lower, name, sentence)
CompletenessCheck: Validates presence (non-null) of values
RangeCheck: Validates numeric values within min/max range
RegexCheck: Validates values match regular expression patterns
FormatCheck: Validates value formats using intelligent format detection
DataTypeCheck: Validates data types with intelligent type inference
DataFrame Integration#
Pandas Support:
Memory-efficient chunked processing for large DataFrames
Configurable chunk sizes for optimal performance
Single validation result column containing all metrics
Handles DataFrames from thousands to millions of rows
PySpark Support:
Distributed validation using Spark UDFs
Scalable to billions of rows
Consistent API with Pandas integration
Struct column output with all validation metrics
REST API Integration#
Integration with IBM Cloud Pak for Data:
GlossaryProvider: Fetch glossary terms and data quality constraints
CamsProvider: Fetch data assets from Catalog Asset Management System
AssetsProvider: Manage data assets and their metadata
DimensionsProvider: Manage data quality dimensions
ChecksProvider: Manage data quality checks
IssuesProvider: Track and manage data quality issues
DQSearchProvider: Search for DQ checks and assets by native ID
Features:
Thread-safe concurrent access
Automatic retry with exponential backoff
Comprehensive error handling
Type-safe request/response models
Result Consolidation#
Aggregate and analyze validation results:
Overall statistics across all validations
Per-column statistics and error tracking
Per-check statistics and pass rates
Combined statistics with filtering
Dimension-based issue tracking
Error retrieval by column, check, or both
Extensibility#
BaseCheck: Easy to extend with custom validation checks
Modular Architecture: Add new modules without affecting existing functionality
Plugin System: Future support for third-party extensions
Performance#
Optimized for Speed: Efficient validation algorithms
Memory Efficient: Chunked processing for large datasets
Scalable: From single records to billions of rows
Parallel Processing: Support for distributed validation with PySpark
Type Safety#
Full type hints throughout the SDK
Pydantic models for data validation
IDE autocomplete and type checking support
Runtime type validation
DPH Services Module#
Python client library for IBM Data Product Hub API, providing programmatic access to data product management.
Container Management#
Initialize and configure data product containers
Manage delivery methods and domain structures
Service credential management
API key operations
Data Product Lifecycle#
Create, update, and delete data products
Draft management with version control
Publish drafts to releases
Retire releases when needed
Pagination support for large datasets
Contract Terms#
Manage contract terms and documents
Create reusable contract templates
Attach terms and conditions to data products
Service level agreement management
Domain Organization#
Create and manage domains and subdomains
Organize data products by business area
Multi-industry domain support
Hierarchical domain structures
Asset Visualization#
Create data asset visualizations
Reinitiate visualizations with updated assets
Support for multiple assets per visualization
ODCS Generator Module#
Automated generation of Open Data Contract Standard (ODCS) v3.1.0 compliant YAML files from data catalog metadata.
Multi-Catalog Support#
Collibra Integration: Extract metadata from Collibra data catalog
Informatica CDGC: Extract metadata from Informatica Cloud Data Governance and Catalog
Extensible architecture for additional catalog sources
Metadata Extraction#
Automatic asset metadata extraction via REST APIs
Column discovery through catalog relations
Data type mapping (logical and physical)
Classification support via GraphQL (Collibra)
Tag integration at asset and column levels
Custom attribute preservation
ODCS Generation#
ODCS v3.1.0 compliant YAML output
Complete schema definition with column metadata
Data quality rules integration
Service level agreement specifications
Governance and ownership information
Data Type Mapping#
Intelligent mapping of catalog types to ODCS types
Support for logical types (string, integer, number, timestamp, boolean)
Physical type preservation with precision and scale
Custom type mapping support
Data Product Recommender Module#
Analyze database query logs to identify high-value tables and logical groupings for data product prioritization.
Multi-Platform Support#
Snowflake: Query log analysis from ACCOUNT_USAGE.QUERY_HISTORY
Databricks: Query log analysis from system.query.history
BigQuery: Query log analysis from INFORMATION_SCHEMA.JOBS_BY_PROJECT
watsonx.data: Query log analysis from system.runtime.queries
Intelligent Scoring#
Query frequency analysis (37.5% weight)
User diversity metrics (37.5% weight)
Recency scoring (15% weight)
Consistency patterns (10% weight)
Customizable scoring weights
Table Grouping#
Identify tables frequently used together
Cohesion analysis for logical groupings
User reach metrics across groups
Group scoring with multiple factors
Output Formats#
Markdown: Human-readable reports with tables and formatting
JSON: Machine-readable format for automation and AI agents
Star ratings (1-5 stars) for quick assessment
Detailed metrics and query pattern analysis
CLI and Python API#
Command-line interface for quick analysis
Python API for programmatic integration
File-based input (CSV and JSON)
Configurable output directory and format
Future Modules#
The SDK’s modular architecture is designed to accommodate additional modules from different teams:
Data profiling and statistics
Data lineage tracking
Data catalog integration
Additional data quality features
Custom team-specific functionality
Each module can leverage the common authentication and configuration infrastructure while maintaining independence.
Next Steps#
Learn about recent changes in Release Notes
Find answers to common questions in FAQ
Check Known Issues for current limitations
Start using the SDK with Authentication