tokenize_corpus module
Code to tokenize an in-house corpus/corpora using pre-trained tokenizer
Tokenize corpus using pre-trained tokenizer
usage: tokenize_corpus.py [-h] [--input_file INPUT_FILE] [--model_path MODEL_PATH] [--output OUTPUT]
Named Arguments
- --input_file
path to corpus/corpora
- --model_path
path where the trained tokenizer is be saved
Default: “models/model_path”
- --output
output file path
Default: “temp”