Data Processing Overview
The Data Processing Framework is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. Various runtimes are available to execute the transforms using a common shared methodology and mechanism to configure input and output across either local or S3-base storage.
The framework allows simple 1:1 transformation of (parquet) files, but also enables more complex transformations requiring coordination among transforming nodes. This might include operations such as de-duplication, merging, and splitting. The framework uses a plug-in model for the primary functions. The core transformation-specific classes/interfaces are as follows:
- AbstractBinaryTransform - a simple, easily-implemented interface allowing the definition transforms of arbitrary data as a byte array. Additionally table transform interface is provided allowing definition of transforms operating on pyarrow tables.
- TransformConfiguration - defines the transform short name, its implementation class, and command line configuration parameters.
In support of running a transform over a set of input data in a runtime, the following class/interfaces are provided:
- AbstractTransformLauncher - is the central
runtime interfacee expected to be implemented by each runtime (python
ray, spark, etc.) to apply a transform to a set of data.
It is configured with a
TransformRuntimeConfiguration
and aDataAccessFactory
instance (see below). - DataAccessFactory - is
used to configure the input and output data files to be processed and creates
the
DataAccess
instance (see below) according to the CLI parameters. - TransformRuntimeConfiguration - captures
the
TransformConfiguration
and runtime-specific configuration. - DataAccess - is the interface defining data i/o methods and selection. Implementations for local and S3 storage are provided.
To learn more consider the following: