Known Issues and Limitations#

This page documents known issues, limitations, and workarounds for the IBM watsonx.data intelligence SDK.

Current Known Issues#

Version 1.0.0#

Performance

  • Large Regex Patterns: Complex regex patterns in RegexCheck may impact performance on large datasets

    • Workaround: Simplify regex patterns or use FormatCheck for common formats

    • Status: Performance optimization planned for v0.6.0

  • PySpark Broadcast Variables: Very large metadata objects may cause memory issues when broadcast to workers

    • Workaround: Keep metadata objects minimal, avoid unnecessary column definitions

    • Status: Investigating optimization strategies

DataFrame Integration

  • Pandas Chunk Size: Default chunk size (1000) may not be optimal for all datasets

    • Workaround: Experiment with different chunk sizes based on your data and memory

    • Status: Working on adaptive chunk sizing

  • PySpark Struct Columns: Nested struct columns in validation results may be difficult to query

    • Workaround: Use explode() or select() to access nested fields

    • Status: Considering flattened output option

REST API Integration

  • Token Refresh: In rare cases, token refresh may fail during long-running operations

    • Workaround: Implement retry logic in your application

    • Status: Improving token management in v0.6.0

  • Rate Limiting: No built-in rate limiting for API calls

    • Workaround: Implement your own rate limiting if making many concurrent requests

    • Status: Planned for v0.7.0

Type Inference

  • Ambiguous Formats: Some date/time formats may be ambiguous (e.g., MM/DD vs DD/MM)

    • Workaround: Use explicit format specifications in FormatCheck

    • Status: Considering locale-aware format detection

Current Limitations#

Platform Support#

  • Operating Systems: Tested on Linux, macOS, and Windows. Some features may behave differently on Windows

  • Python Versions: Python 3.8+ supported, but some type hints require 3.10+

  • Architecture: Primarily tested on x86_64, limited testing on ARM

DataFrame Support#

  • Pandas Versions: Tested with Pandas 1.3.0+, some features may not work with older versions

  • PySpark Versions: Tested with PySpark 3.0.0+, older versions not supported

  • Dask: Not currently supported (planned for future release)

  • Polars: Not currently supported (under consideration)

Validation Checks#

  • Custom Checks: Limited documentation for creating complex custom checks

  • Check Combinations: No built-in support for combining multiple checks with AND/OR logic

  • Conditional Validation: No built-in support for conditional validation rules

REST API Integration#

  • API Versions: Tested with specific IBM Cloud Pak for Data versions, compatibility with older versions not guaranteed

  • Batch Operations: Limited support for batch operations (creating multiple checks at once)

  • Async Operations: No async/await support for API calls

Data Types#

  • Complex Types: Limited support for complex nested data types

  • Binary Data: No validation support for binary data types

  • JSON/XML: No built-in validation for JSON or XML structure

Workarounds and Best Practices#

Performance Optimization#

For large datasets:

# Use appropriate chunk sizes
validator.validate_dataframe(df, chunk_size=5000)  # Adjust based on memory

# Or use PySpark for very large datasets
from wxdi.dq_validator.integrations import SparkValidator
spark_validator = SparkValidator(metadata)
result_df = spark_validator.validate_dataframe(spark_df)

Memory Management#

For memory-constrained environments:

# Process in smaller chunks
for chunk in pd.read_csv('large_file.csv', chunksize=1000):
    results = validator.validate_dataframe(chunk)
    # Process results immediately
    process_results(results)

Error Handling#

Implement robust error handling:

from requests.exceptions import RequestException

try:
    provider = GlossaryProvider(config)
    terms = provider.get_glossary_terms()
except RequestException as e:
    # Handle network errors
    logger.error(f"API call failed: {e}")
    # Implement retry logic or fallback

Type Checking#

For better type safety:

# Use type hints and mypy
from typing import List
from wxdi.dq_validator import ValidationResult

def process_results(results: List[ValidationResult]) -> None:
    # Your code here with full type checking
    pass

Reporting Issues#

If you encounter an issue not listed here:

  1. Check GitHub Issues: Search existing issues to see if it’s already reported

  2. Verify Your Setup: Ensure you’re using supported versions of Python and dependencies

  3. Create Minimal Reproduction: Prepare a minimal code example that reproduces the issue

  4. Report the Issue: Open a new issue on GitHub with:

    • Clear description of the problem

    • Steps to reproduce

    • Expected vs actual behavior

    • Environment details (Python version, OS, SDK version)

    • Error messages and stack traces

Planned Improvements#

Version 0.6.0#

  • Performance optimizations for regex validation

  • Improved token management

  • Adaptive chunk sizing for Pandas

  • Enhanced error messages

Version 0.7.0#

  • Rate limiting for API calls

  • Batch operations support

  • Async/await support for API calls

  • Additional DataFrame backends (Dask consideration)

Version 1.0.0#

  • Stable API with backward compatibility guarantees

  • Comprehensive test coverage

  • Production-ready performance

  • Full documentation with advanced examples

Compatibility Matrix#

Tested Configurations#

Python

Pandas

PySpark

OS

Status

3.8

1.3.x

3.0.x

Linux

✓ Supported

3.9

1.4.x

3.1.x

Linux

✓ Supported

3.10

1.5.x

3.2.x

Linux/macOS

✓ Supported

3.11

2.0.x

3.3.x

Linux/macOS

✓ Supported

3.12

2.1.x

3.4.x

Linux/macOS/Windows

✓ Supported

Untested Configurations#

The following configurations may work but are not officially tested:

  • Python 3.7 (end of life)

  • Pandas < 1.3.0

  • PySpark < 3.0.0

  • ARM architecture

  • BSD operating systems

Getting Help#

If you’re experiencing issues:

We appreciate your patience as we continue to improve the SDK!