Training a Masked Language Model from Scratch

We are now ready to train our own language model from scratch.

We run the scripts/run_mlm.sh script script to train the model.

 1TRANSFORMERS_CACHE=/tmp/ PYTORCH_TRANSFORMERS_CACHE=/tmp/ PYTHONIOENCODING=utf-8 python src/lm/run_mlm.py \
 2--model_type $5 \
 3--tokenizer_name $4 \
 4--per_device_train_batch_size 8 \
 5--per_device_eval_batch_size 8 \
 6--train_file $1 \
 7--validation_file $2 \
 8--remove_unused_columns False \
 9--preprocessing_num_workers $6 \
10--pad_to_max_length \
11--line_by_line \
12--do_train \
13--do_eval \
14--num_train_epochs 1 \
15--overwrite_output_dir \
16--output_dir $3 \
17--report_to none \
18--cache_dir /tmp/ \
19--evaluation_strategy steps \
20--logging_steps 10000 \
21--save_steps 10000 \
22--save_total_limit 2

However, for testing we override the parameters and write to a new script file scripts/run_mlm_test.sh. This following argument reduces the model size to be able to train on a CPU system.

 1TRANSFORMERS_CACHE=/tmp/ PYTORCH_TRANSFORMERS_CACHE=/tmp/ PYTHONIOENCODING=utf-8 python src/lm/run_mlm.py \
 2--model_type $5 \
 3--tokenizer_name $4 \
 4--config_overrides="hidden_size=128,intermediate_size=512,num_attention_heads=4,num_hidden_layers=2,max_position_embeddings=512" \
 5--per_device_train_batch_size 8 \
 6--per_device_eval_batch_size 8 \
 7--train_file $1 \
 8--validation_file $2 \
 9--remove_unused_columns False \
10--preprocessing_num_workers $6 \
11--pad_to_max_length \
12--max_train_samples 100 \
13--max_eval_samples 100 \
14--line_by_line \
15--do_train \
16--do_eval \
17--num_train_epochs 1 \
18--overwrite_output_dir \
19--output_dir $3 \
20--report_to none \
21--cache_dir /tmp/ \
22--evaluation_strategy steps \
23--logging_steps 10 \
24--save_steps 10 \
25--save_total_limit 2

We now train the MLM with the test script file and share a snapshot of the training process

 1$ sh scripts/run_mlm_test.sh demo/data/lm/english_sample.txt demo/data/lm/english_sample.txt demo/model/mlm/ demo/model/tokenizer/ bert 16
 2
 304/07/2022 20:12:41 - WARNING - __main__ - You are instantiating a new config instance from scratch.
 404/07/2022 20:12:41 - WARNING - __main__ - Overriding config: hidden_size=128,intermediate_size=512,num_attention_heads=4,num_hidden_layers=4,max_position_embeddings=512
 504/07/2022 20:12:41 - WARNING - __main__ - New config: BertConfig {
 6"attention_probs_dropout_prob": 0.1,
 7"classifier_dropout": null,
 8"hidden_act": "gelu",
 9"hidden_dropout_prob": 0.1,
10"hidden_size": 128,
11"initializer_range": 0.02,
12"intermediate_size": 512,
13"layer_norm_eps": 1e-12,
14"max_position_embeddings": 512,
15"model_type": "bert",
16"num_attention_heads": 4,
17"num_hidden_layers": 4,
18"pad_token_id": 0,
19"position_embedding_type": "absolute",
20"transformers_version": "4.14.0",
21"type_vocab_size": 2,
22"use_cache": true,
23"vocab_size": 30522
24}
25
26[INFO|tokenization_utils_base.py:1671] 2022-04-07 20:12:41,922 >> Didn't find file demo/model/tokenizer/vocab.json. We won't load it.
27[INFO|tokenization_utils_base.py:1671] 2022-04-07 20:12:41,922 >> Didn't find file demo/model/tokenizer/merges.txt. We won't load it.
28[INFO|tokenization_utils_base.py:1740] 2022-04-07 20:12:41,923 >> loading file None
29[INFO|tokenization_utils_base.py:1740] 2022-04-07 20:12:41,923 >> loading file None
30[INFO|tokenization_utils_base.py:1740] 2022-04-07 20:12:41,923 >> loading file demo/model/tokenizer/tokenizer.json
31[INFO|tokenization_utils_base.py:1740] 2022-04-07 20:12:41,923 >> loading file demo/model/tokenizer/added_tokens.json
32[INFO|tokenization_utils_base.py:1740] 2022-04-07 20:12:41,923 >> loading file demo/model/tokenizer/special_tokens_map.json
33[INFO|tokenization_utils_base.py:1740] 2022-04-07 20:12:41,923 >> loading file demo/model/tokenizer/tokenizer_config.json
34
3504/07/2022 20:12:42 - WARNING - __main__ - Total parameters in the model = 4.59M params
3604/07/2022 20:12:42 - WARNING - __main__ - Training new model from scratch : Total size = 4.59M params
37
38[INFO|trainer.py:1204] 2022-04-07 20:12:42,760 >> ***** Running training *****
39[INFO|trainer.py:1205] 2022-04-07 20:12:42,760 >>   Num examples = 1895
40[INFO|trainer.py:1206] 2022-04-07 20:12:42,760 >>   Num Epochs = 1
41[INFO|trainer.py:1207] 2022-04-07 20:12:42,760 >>   Instantaneous batch size per device = 8
42[INFO|trainer.py:1208] 2022-04-07 20:12:42,760 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
43[INFO|trainer.py:1209] 2022-04-07 20:12:42,760 >>   Gradient Accumulation steps = 1
44[INFO|trainer.py:1210] 2022-04-07 20:12:42,760 >>   Total optimization steps = 237
45
46{'loss': 6.1333, 'learning_rate': 2.8902953586497894e-05, 'epoch': 0.42}
47{'eval_loss': 6.023196220397949, 'eval_runtime': 132.1578, 'eval_samples_per_second': 14.339, 'eval_steps_per_second': 1.793, 'epoch': 0.42}
48{'loss': 5.9755, 'learning_rate': 7.805907172995782e-06, 'epoch': 0.84}
49
50Training completed. Do not forget to share your model on huggingface.co/models =)
51
52{'eval_loss': 5.97206974029541, 'eval_runtime': 81.7657, 'eval_samples_per_second': 23.176, 'eval_steps_per_second': 2.899, 'epoch': 0.84}
53{'train_runtime': 533.2352, 'train_samples_per_second': 3.554, 'train_steps_per_second': 0.444, 'train_loss': 6.034984540335739, 'epoch': 1.0}
54***** train metrics *****
55epoch                    =        1.0
56train_loss               =      6.035
57train_runtime            = 0:08:53.23
58train_samples            =       1895
59train_samples_per_second =      3.554
60train_steps_per_second   =      0.444
6104/07/2022 20:27:27 - WARNING - __main__ - *** Evaluate ***
62{'task': {'name': 'Masked Language Modeling', 'type': 'fill-mask'}}
63***** eval metrics *****
64epoch                   =        1.0
65eval_loss               =     5.9712
66eval_runtime            = 0:01:24.94
67eval_samples            =       1895
68eval_samples_per_second =     22.308
69eval_steps_per_second   =       2.79
70perplexity              =   391.9806

The trained model is present in the following folder and ready to fine-tune

 1$ ls demo/model/mlm/
 2README.md
 3all_results.json
 4added_tokens.json
 5checkpoint-100/
 6checkpoint-200/
 7config.json
 8eval_results.json
 9pytorch_model.bin
10special_tokens_map.json
11tokenizer_config.json
12tokenizer.json
13trainer_state.json
14train_results.json
15training_args.bin
16vocab.txt