train_tokenizer module

Code to train a tokenizer on in-house corpus/corpora

This code takes a corpus or corpora as input and trains a sub-word tokenizer using Huggingface Transformers Library. Optionally the code takes in a vocab file which contains words in it’s own line and which shouldn’t be split by the tokenizer.

train_tokenizer.add_vocab_from_file(tokenizer, vocab_file)[source]

This function reads vocabulary from the file and adds the words to the trained tokenizer The vocabulary file should contain every word in it’s own line The tokenizer will not split these words

Parameters

tokenizer (AutoTokenizer) – this is the tokenizer we just trained, it could also be any pre-trained tokenizer
vocab_file (str) – vocabulary file containing word per line which need not be split into subwords

Returns

None

train_tokenizer.main(args)[source]

Train a tokenizer from scratch

usage: train_tokenizer.py [-h] [--input_file INPUT_FILE] [--name NAME] [--tokenizer_type {byte,wordpiece}] [--vocab_file VOCAB_FILE] [--vocab_size VOCAB_SIZE]

Named Arguments

--input_file

path to corpus/corpora on which the tokenizer has to be trained

Default: “data/input.txt”

--name

path where the trained tokenizer will be saved

Default: “models/byte_tokenizer”

--tokenizer_type

Possible choices: byte, wordpiece

type of tokenizer to be trained

Default: “byte”

--vocab_file

vocabulary file containing word per line which need not be split into subwords

--vocab_size

Vocabulary Size

Default: 30000