Data Processing Overview

The Data Processing Framework is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. Various runtimes are available to execute the transforms using a common shared methodology and mechanism to configure input and output across either local or S3-base storage.

The framework allows simple 1:1 transformation of (parquet) files, but also enables more complex transformations requiring coordination among transforming nodes. This might include operations such as de-duplication, merging, and splitting. The framework uses a plug-in model for the primary functions. The core transformation-specific classes/interfaces are as follows:

AbstractBinaryTransform - a simple, easily-implemented interface allowing the definition transforms of arbitrary data as a byte array. Additionally table transform interface is provided allowing definition of transforms operating on pyarrow tables.
TransformConfiguration - defines the transform short name, its implementation class, and command line configuration parameters.

In support of running a transform over a set of input data in a runtime, the following class/interfaces are provided:

AbstractTransformLauncher - is the central runtime interfacee expected to be implemented by each runtime (python ray, spark, etc.) to apply a transform to a set of data. It is configured with a TransformRuntimeConfiguration and a DataAccessFactory instance (see below).
DataAccessFactory - is used to configure the input and output data files to be processed and creates the DataAccess instance (see below) according to the CLI parameters.
TransformRuntimeConfiguration - captures the TransformConfiguration and runtime-specific configuration.
DataAccess - is the interface defining data i/o methods and selection. Implementations for local and S3 storage are provided.

Core Framework Classes

To learn more consider the following: