Training a Tokenizer from Scratch

Let us now look into a short tutorial on training a tokenizer from scratch. All the programs are run from the root folder of the repository.

To train a tokenizer we need a corpus. For this tutorial, we provide a sample corpus in the following folder.

1 $ ls demo/data/lm/
2 english_sample.txt

The sample snippet of the corpus is here

 1 $ head demo/data/lm/english_sample.txt
 2 The Project Gutenberg eBook of Romeo and Juliet, by William Shakespeare
 3
 4 This eBook is for the use of anyone anywhere in the United States and
 5 most other parts of the world at no cost and with almost no restrictions
 6 whatsoever. You may copy it, give it away or re-use it under the terms
 7 of the Project Gutenberg License included with this eBook or online at
 8 www.gutenberg.org. If you are not located in the United States, you
 9 will have to check the laws of the country where you are located before
10 using this eBook.
11
12 $ wc demo/data/lm/english_sample.txt
13 2136   10152   56796 demo/data/lm/english_sample.txt

This text is extracted from Romeo and Juliet play by William Shakespeare from the Gutenberg Corpus ( https://www.gutenberg.org/cache/epub/1513/pg1513.txt )

We will train a Wordpiece tokenizer with a vocab size of around 500. The smaller vocab size is due to the corpus being small.

 1$ python src/tokenizer/train_tokenizer.py \
 2     --input_file demo/data/lm/english_sample.txt \
 3     --name demo/model/tokenizer/ \
 4     --tokenizer_type wordpiece \
 5     --vocab_size 500
 6
 7 [00:00:00] Pre-processing files (0 Mo)              ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                100%
 8 [00:00:00] Tokenize words                           ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4252     /     4252
 9 [00:00:00] Count pairs                              ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4252     /     4252
10 [00:00:00] Compute merges                           ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 387      /      387

The following files will be created inside demo/model/tokenizer/ folder

1$ ls demo/model/tokenizer/
2tokenizer.json

Creating Model Configuration File

By default the train_tokenizer.py script doesn’t create the model configuration files. The configuration file is required to load the model from AutoTokenizer.from_pretrained(). We now use the script create_config.py script to create the configuration file.

1$ python create_config.py \
2    --path demo/model/tokenizer/ \
3    --type gpt2