train_tokenizer module
Code to train a tokenizer on in-house corpus/corpora
This code takes a corpus or corpora as input and trains a sub-word tokenizer using Huggingface Transformers Library. Optionally the code takes in a vocab file which contains words in it’s own line and which shouldn’t be split by the tokenizer.
- train_tokenizer.add_vocab_from_file(tokenizer, vocab_file)[source]
This function reads vocabulary from the file and adds the words to the trained tokenizer The vocabulary file should contain every word in it’s own line The tokenizer will not split these words
- Parameters
tokenizer (AutoTokenizer) – this is the tokenizer we just trained, it could also be any pre-trained tokenizer
vocab_file (str) – vocabulary file containing word per line which need not be split into subwords
- Returns
None
Train a tokenizer from scratch
usage: train_tokenizer.py [-h] [--input_file INPUT_FILE] [--name NAME] [--tokenizer_type {byte,wordpiece}] [--vocab_file VOCAB_FILE] [--vocab_size VOCAB_SIZE]
Named Arguments
- --input_file
path to corpus/corpora on which the tokenizer has to be trained
Default: “data/input.txt”
- --name
path where the trained tokenizer will be saved
Default: “models/byte_tokenizer”
- --tokenizer_type
Possible choices: byte, wordpiece
type of tokenizer to be trained
Default: “byte”
- --vocab_file
vocabulary file containing word per line which need not be split into subwords
- --vocab_size
Vocabulary Size
Default: 30000