Fuzzy Dedup
Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
Summary
This project wraps the Fuzzy Dedup transform with a Ray runtime.
Configuration and command line Options
Fuzzy Dedup configuration and command line options are the same as for the base python transform.
Running
Launched Command Line Options
When running the transform with the Ray launcher (i.e. TransformLauncher), In addition to those available to the transform as defined in here, the set of ray launcher are available.
Running the samples
To run the samples, use the following make
target to create a virtual environment:
source venv/bin/activate
cd src
python signature_calc_local_ray.py
python cluster_analysis_local_ray.py
python get_duplicate_list_local_ray.py
python data_cleaning_local_ray.py
Transforming data using the transform image
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.
Testing
For testing fuzzy deduplication in a ray runtime, use the following make
targets. To launch integration tests
for all the component transforms of fuzzy dedup (signature calculation, cluster analysis, get duplicate list and data
cleaning) use:
To test the creation of the Docker image for fuzzy dedup transform and the capability to run a local program inside that image, use: