Distributed tokenization module for data sets using any Hugging Face compatible tokenizer.

📝 Table of Contents

Summary
Running
CLI Options

Data Tokenization

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Summary

The data tokenization transform operates by converting a (non-empty) input table into an output table using a pre-trained tokenizer. The input table is required to have a minimum of two columns, named document_id and contents by default. However, alternate column names can be specified using --tkn_doc_id_column for the document id and --tkn_doc_content_column for the document contents. It is essential for the values within the document_id column to be unique across the dataset, while the contents column stores their respective document content. To execute example demonstrations within this directory, a machine with 64GiB of RAM is recommended.

To specify a pre-trained tokenizer, utilize the --tkn_tokenizer parameter. This parameter accepts the name of a tokenizer ready for download from Hugging Face, such as hf-internal-testing/llama-tokenizer, bigcode/starcoder, or any other tokenizer compatible with the Hugging Face AutoTokenizer library. Additionally, you can employ the --tkn_tokenizer_args parameter to include extra arguments specific to the chosen tokenizer. For instance, when loading a Hugging Face tokenizer like bigcode/starcoder, which necessitate an access token, you can specify use_auth_token=<your token> in --tkn_tokenizer.

The tokenization transformer utilizes the specified tokenizer to tokenize each row, assuming each row represents a document, in the input table and save it to a corresponding row in the output table. The output table generally consists of four columns: tokens, document_id, document_length, and token_count.

The tokens stores the sequence of token IDs generated by the tokenizer during the document tokenization process. The document_id (or the designated name specified in --tkn_doc_id_column) contains the document ID, while document_length and token_count respectively record the length of the document and the total count of generated tokens. During tokenization, the tokenizer will disregard empty documents (rows) in the input table, as well as documents that yield no tokens or encounter failure during tokenization. The count of such documents will be stored in the num_empty_rows field of the metadata file.

In certain cases, the tokenization process of some tokenizers may be sluggish, particularly when handling lengthy documents containing millions of characters. To address this, you can employ the --tkn_chunk_size parameter to define the length of chunks to tokenize at a given time. For English text (en), it is recommended to set the chunk size to 20,000, roughly equivalent to 15 pages of text. The tokenizer will then tokenize each chunk separately and combine their resulting token IDs. By default, the value of --tkn_chunk_size is 0, indicating that each document is tokenized as a whole, regardless of its length.

Running

CLI Options

The following command line arguments are available in addition to the options provided by the python launcher and the python launcher.

  --tkn_tokenizer TKN_TOKENIZER
                        Tokenizer used for tokenization. It also can be a path to a pre-trained tokenizer. By defaut, `hf-internal-testing/llama-tokenizer` from HuggingFace is used
  --tkn_tokenizer_args TKN_TOKENIZER_ARGS
                        Arguments for tokenizer. For example, `cache_dir=/tmp/hf,use_auth_token=Your_HF_authentication_token` could be arguments for tokenizer `bigcode/starcoder` from HuggingFace
  --tkn_doc_id_column TKN_DOC_ID_COLUMN
                        Column contains document id which values should be unique across dataset
  --tkn_doc_content_column TKN_DOC_CONTENT_COLUMN
                        Column contains document content
  --tkn_text_lang TKN_TEXT_LANG
                        Specify language used in the text content for better text splitting if needed
  --tkn_chunk_size TKN_CHUNK_SIZE
                        Specify >0 value to tokenize each row/doc in chunks of characters (rounded in words)

Running the samples

To run the samples, use the following make targets

run-cli-sample - runs src/tokenization_transform_python.py using command line args
run-local-sample - runs src/tokenization_local_python.py

These targets will activate the virtual environment and set up any configuration needed. Use the -n option of make to see the detail of what is done to run the sample.

For example,

make run-cli-sample
...

Then

ls output

To see results of the transform.

Transforming data using the transform image

To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.