Training a Tokenizer from Scratch
======================================

Let us now look into a short tutorial on training a tokenizer from scratch. All the programs are run from the root folder of the repository.

To train a tokenizer we need a corpus. For this tutorial, we provide a sample corpus in the following folder. 

.. code-block:: console
   :linenos:

    $ ls demo/data/lm/
    english_sample.txt

The sample snippet of the corpus is here

.. code-block:: console
   :linenos:

    $ head demo/data/lm/english_sample.txt
    The Project Gutenberg eBook of Romeo and Juliet, by William Shakespeare

    This eBook is for the use of anyone anywhere in the United States and
    most other parts of the world at no cost and with almost no restrictions
    whatsoever. You may copy it, give it away or re-use it under the terms
    of the Project Gutenberg License included with this eBook or online at
    www.gutenberg.org. If you are not located in the United States, you
    will have to check the laws of the country where you are located before
    using this eBook.

    $ wc demo/data/lm/english_sample.txt
    2136   10152   56796 demo/data/lm/english_sample.txt


This text is extracted from Romeo and Juliet play by William Shakespeare from the Gutenberg Corpus ( https://www.gutenberg.org/cache/epub/1513/pg1513.txt )


We will train a Wordpiece tokenizer with a vocab size of around ``500``. The smaller vocab size is due to the corpus being small.

.. code-block:: console
   :linenos:

   $ python src/tokenizer/train_tokenizer.py \
        --input_file demo/data/lm/english_sample.txt \
        --name demo/model/tokenizer/ \
        --tokenizer_type wordpiece \
        --vocab_size 500

    [00:00:00] Pre-processing files (0 Mo)              ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████                100%
    [00:00:00] Tokenize words                           ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4252     /     4252
    [00:00:00] Count pairs                              ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4252     /     4252
    [00:00:00] Compute merges                           ██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 387      /      387


The following files will be created inside ``demo/model/tokenizer/`` folder

.. code-block:: console
   :linenos:

   $ ls demo/model/tokenizer/
   tokenizer.json
   

Creating Model Configuration File
======================================

By default the `train_tokenizer.py` script doesn't create the model configuration files. The configuration file is required to load the model from `AutoTokenizer.from_pretrained()`. We now use the script `create_config.py` script to create the configuration file.


.. code-block:: console
    :linenos:

    $ python create_config.py \
        --path demo/model/tokenizer/ \
        --type gpt2