Running a Transform Image
Here we address a simple use case of applying a single transform to a
set of parquet files.
We'll use one of the pre-built docker images from the repository
to process the data.
Additionally, what follows uses the
python runtime
image, but the examples below should also work for the
ray
or
spark
runtime images (-ray
or -spark
image name suffix instead of -python
).
Getting an image
You may build the transform locally in the repository, for example,
produces Or, you can use the pre-built images (latest, or 0.2.1 or later tags) on quay.io found at https://quay.io/user/dataprep1.Local Data - Python Runtime
To use an image to process local data we will mount the host
input and output directories into the image. Any mount
point can be used, but we will use /input
and /output
.
To process data in the /home/me/input
directory and write it
to the /home/me/output
directory, we mount these directories into
the image at the above mount points.
So for example, using the locally built noop
transform:
docker run --rm
-v /home/me/input:/input \
-v /home/me/output:/output \
noop-python:latest \
python noop_transform_python.py \
--data_local_config "{ \
'input_folder' : '/input', \
'output_folder' : '/output' \
}"
quay.io/dataprep1/data-prep-kit/noop-python:latest
for noop-python:latest
, as follows:
docker run --rm
-v /home/me/input:/input \
-v /home/me/output:/output \
quay.io/dataprep1/data-prep-kit/noop-python:latest \
python noop_transform_python.py \
--data_local_config "{ \
'input_folder' : '/input', \
'output_folder' : '/output' \
}"
Local Data - Ray Runtime
To use the ray runtime, we must
1. Switch to using the ray-based image noop-ray:latest
2. Use the ray runtime python main() defined in noop_transform_ray.py
For example, using the quay.io image
docker run --rm
-v /home/me/input:/input \
-v /home/me/output:/output \
quay.io/dataprep1/data-prep-kit/noop-ray:latest \
python noop_transform_ray.py \
--data_local_config "{ \
'input_folder' : '/input', \
'output_folder' : '/output' \
}"
S3-located Data - Python Runtime
When processing data located in S3 buckets, one can use the same image
and specify different --data_s3_*
configuration as follows: