Skip to content

Testing transforms with S3

For testing transforms with S3 we are using Minio, which can be installed on Linux, macOS and Windows. Here we are assuming Mac usage, refer to documentation above for other platforms.

Installing Minio

The simplest way to install Minio on Mac is using Homebrew. Use the following command:

brew install minio/stable/minio

In addition to the Minio server install the latest stable MinIO cli using

brew install minio/stable/mc
Now you can start Minio server using the following command:

minio server start

When it starts you can connect to the server UI using the following address: http://localhost:9000 The default user name/password is minioadmin|minioadmin

Populating Minio with testing data

Populating Minio server with test data can be done using mc. First configure mc to work with the local Minio server:

mc alias set local http://127.0.0.1:9000 minioadmin minioadmin

This set an alias local to 'mc' connected to the local Minio server instance. Now we can use our mc instance to populate server using a set of commands provided by mc.

First test the connection to the newly added MinIO deployment using the mc admin info command:

mc admin info local

To copy the data to Minio, you first need to create a bucket:

mc mb local/test

Once the bucket is created, you can copy files (assuming you are in the transforms directory), using:

mc cp --recursive tools/ingest2parquet/test-data/input/ local/test/ingest2parquet/input
mc cp --recursive code/code_quality/test-data/input/ local/test/code_quality/input
mc cp --recursive code/proglang_select/test-data/input/ local/test/proglang_select/input
mc cp --recursive code/proglang_select/test-data/languages/ local/test/proglang_select/languages
mc cp --recursive code/malware/test-data/input/ local/test/malware/input

mc cp --recursive language/doc_quality/test-data/input/ local/test/doc_quality/input
mc cp --recursive language/lang_id/ray/test-data/input/ local/test/lang_id/input

mc cp --recursive universal/blocklist/test-data/input/ local/test/blocklist/input
mc cp --recursive universal/blocklist/test-data/domains/ local/test/blocklist/domains
mc cp --recursive universal/doc_id/test-data/input/ local/test/doc_id/input
mc cp --recursive universal/ededup/test-data/input/ local/test/ededup/input
mc cp --recursive universal/fdedup/test-data/input/ local/test/fdedup/input
mc cp --recursive universal/filter/test-data/input/ local/test/filter/input
mc cp --recursive universal/noop/test-data/input/ local/test/noop/input
mc cp --recursive universal/resize/test-data/input/ local/test/resize/input
mc cp --recursive universal/tokenization/test-data/ds01/input/ local/test/tokenization/ds01/input
mc cp --recursive universal/tokenization/test-data/ds02/input/ local/test/tokenization/ds02/input

Note, that once the data is copied, Minio is storing it on the local file system, so you do not need to copy it again after cluster restart

Creating access and secret key for Minio access

The last thing is to add Minio access and secret keys for accessing it. The following command:

mc admin user svcacct add --access-key "localminioaccesskey" --secret-key "localminiosecretkey" local minioadmin

creates both access and secret key for usage by the applications