Document ID Python Annotator
Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.
Building
A docker file that can be used for building docker image. You can use
Configuration and command line Options
The set of dictionary keys defined in DocIDTransform configuration for values are as follows:
- doc_column - specifies name of the column containing the document (required for ID generation)
- hash_column - specifies name of the column created to hold the string document id, if None, id is not generated
- int_id_column - specifies name of the column created to hold the integer document id, if None, id is not generated
- start_id - an id from which ID generator starts ()
At least one of hash_column or int_id_column must be specified.
Running
Launched Command Line Options
When running the transform with the Ray launcher (i.e. TransformLauncher), the following command line arguments are available in addition to the options provided by the ray launcher.
--doc_id_doc_column DOC_ID_DOC_COLUMN
doc column name
--doc_id_hash_column DOC_ID_HASH_COLUMN
Compute document hash and place in the given named column
--doc_id_int_column DOC_ID_INT_COLUMN
Compute unique integer id and place in the given named column
--doc_id_start_id DOC_ID_START_ID
starting integer id
To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.