Skip to content

Document ID Python Annotator

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Building

A docker file that can be used for building docker image. You can use

make build 

Configuration and command line Options

The set of dictionary keys defined in DocIDTransform configuration for values are as follows:

  • doc_column - specifies name of the column containing the document (required for ID generation)
  • hash_column - specifies name of the column created to hold the string document id, if None, id is not generated
  • int_id_column - specifies name of the column created to hold the integer document id, if None, id is not generated
  • start_id - an id from which ID generator starts ()

At least one of hash_column or int_id_column must be specified.

Running

Launched Command Line Options

When running the transform with the Ray launcher (i.e. TransformLauncher), the following command line arguments are available in addition to the options provided by the ray launcher.

  --doc_id_doc_column DOC_ID_DOC_COLUMN
                        doc column name
  --doc_id_hash_column DOC_ID_HASH_COLUMN
                        Compute document hash and place in the given named column
  --doc_id_int_column DOC_ID_INT_COLUMN
                        Compute unique integer id and place in the given named column
  --doc_id_start_id DOC_ID_START_ID
                        starting integer id
These correspond to the configuration keys described above.

To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.