Spark Framework
The Spark runtime implementation is roughly based on the ideas from here, here and here. Spark itself is basically used for execution parallelization, but all data access is based on the framework's data access, thus preserving all the implemented features. At the start of the execution, the list of files to process is obtained (using data access framework) and then split between Spark workers for reading actual data, its transformation and writing it back. The implementation is based on Spark RDD (For comparison of the three Apache Spark APIs: RDDs, DataFrames, and Datasets see this Databricks blog post) As defined by Databricks:
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an
immutable distributed collection of elements of your data, partitioned across nodes in your
cluster that can be operated in parallel with a low-level API that offers transformations
and actions.
In our implementation we are using pyspark.SparkContext.parallelize for running multiple transforms in parallel. We allow 2 options for specifying the number of partitions, determining how many partitions the RDD should be divided into. See here for the explanation of this parameter: * If you specify a positive value of the parameter, Spark will attempt to evenly distribute the data from seq into that many partitions. For example, if you have a collection of 100 elements and you specify numSlices as 4, Spark will try to create 4 partitions with approximately 25 elements in each partition. * If you don’t specify this parameter, Spark will use a default value, which is typically determined based on the cluster configuration or the available resources (number of workers).
Transforms
- SparkTransformRuntimeConfiguration
allows to configure transform to use PySpark. In addition to its base class
TransformRuntimeConfiguration features,
this class includes
get_bcast_params()
method to get very large configuration settings. Before starting the transform execution, the Spark runtime will broadcast these settings to all the workers.
Runtime
Spark runtime extends the base framework with the following set of components: * SparkTransformExecutionConfiguration allows to configure Spark execution * SparkTransformFileProcessor extends AbstractTransformFileProcessor to work on PySpark * SparkTransformLauncher allows to launch PySpark runtime and execute a transform * orchestrate function orchestrates Spark based execution