Training a Tokenizer from Scratch

Let us now look into a short tutorial on training a tokenizer from scratch. All the programs are run from the root folder of the repository.

To train a tokenizer we need a corpus. For this tutorial, we provide a sample corpus in the following folder.

 $ ls demo/data/lm/
 english_sample.txt

The sample snippet of the corpus is here

 $ head demo/data/lm/english_sample.txt
 The Project Gutenberg eBook of Romeo and Juliet, by William Shakespeare

 This eBook is for the use of anyone anywhere in the United States and
 most other parts of the world at no cost and with almost no restrictions
 whatsoever. You may copy it, give it away or re-use it under the terms
 of the Project Gutenberg License included with this eBook or online at
 www.gutenberg.org. If you are not located in the United States, you
 will have to check the laws of the country where you are located before
 using this eBook.

 $ wc demo/data/lm/english_sample.txt
 2136   10152   56796 demo/data/lm/english_sample.txt

This text is extracted from Romeo and Juliet play by William Shakespeare from the Gutenberg Corpus ( https://www.gutenberg.org/cache/epub/1513/pg1513.txt )

We will train a Wordpiece tokenizer with a vocab size of around 500. The smaller vocab size is due to the corpus being small.

$ python src/tokenizer/train_tokenizer.py \
     --input_file demo/data/lm/english_sample.txt \
     --name demo/model/tokenizer/ \
     --tokenizer_type wordpiece \
     --vocab_size 500

 [00:00:00] Pre-processing files (0 Mo)              ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                100%
 [00:00:00] Tokenize words                           ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4252     /     4252
 [00:00:00] Count pairs                              ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4252     /     4252
 [00:00:00] Compute merges                           ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 387      /      387

The following files will be created inside demo/model/tokenizer/ folder

$ ls demo/model/tokenizer/
tokenizer.json

Creating Model Configuration File

By default the train_tokenizer.py script doesn’t create the model configuration files. The configuration file is required to load the model from AutoTokenizer.from_pretrained(). We now use the script create_config.py script to create the configuration file.

$ python create_config.py \
    --path demo/model/tokenizer/ \
    --type gpt2