train_tokenizer module

Code to train a tokenizer on in-house corpus/corpora

This code takes a corpus or corpora as input and trains a sub-word tokenizer using Huggingface Transformers Library. Optionally the code takes in a vocab file which contains words in it’s own line and which shouldn’t be split by the tokenizer.

train_tokenizer.add_vocab_from_file(tokenizer, vocab_file)[source]

This function reads vocabulary from the file and adds the words to the trained tokenizer The vocabulary file should contain every word in it’s own line The tokenizer will not split these words

Parameters
  • tokenizer (AutoTokenizer) – this is the tokenizer we just trained, it could also be any pre-trained tokenizer

  • vocab_file (str) – vocabulary file containing word per line which need not be split into subwords

Returns

None

train_tokenizer.get_command_line_args()[source]
train_tokenizer.main(args)[source]