NL-FM-Toolkit
Getting Started:
Introduction
Installation
Training a Tokenizer from Scratch
Creating Model Configuration File
Training a Masked Language Model from Scratch
Training a Causal Language Model from Scratch
Training a Sequence Labeler
Training a Sequence Classifier
Scripts
Modules
NL-FM-Toolkit
tokenize_corpus module
View page source
tokenize_corpus module
Code to tokenize an in-house corpus/corpora using pre-trained tokenizer
tokenize_corpus.
get_command_line_args
(
)
[source]
tokenize_corpus.
main
(
)
[source]