Training a Causal Language Model from Scratch
We are now ready to train our own language model from scratch.
We run the scripts/run_clm.sh
script script to train the model.
1TRANSFORMERS_CACHE=/tmp/ PYTORCH_TRANSFORMERS_CACHE=/tmp/ PYTHONIOENCODING=utf-8 python src/lm/run_clm.py \
2--model_type $5 \
3--tokenizer_name $4 \
4--per_device_train_batch_size 8 \
5--per_device_eval_batch_size 8 \
6--train_file $1 \
7--validation_file $2 \
8--remove_unused_columns False \
9--preprocessing_num_workers $6 \
10--pad_to_max_length \
11--line_by_line \
12--do_train \
13--do_eval \
14--max_seq_length 512 \
15--num_train_epochs 1 \
16--overwrite_output_dir \
17--output_dir $3 \
18--report_to none \
19--cache_dir /tmp/ \
20--evaluation_strategy steps \
21--logging_steps 10000 \
22--save_steps 10000 \
23--save_total_limit 2
However, for testing we override the parameters and write to a new script file scripts/run_clm_test.sh
. This following argument reduces the model size to be able to train on a CPU system.
1TRANSFORMERS_CACHE=/tmp/ PYTORCH_TRANSFORMERS_CACHE=/tmp/ PYTHONIOENCODING=utf-8 python src/lm/run_clm.py \
2--model_type $5 \
3--tokenizer_name $4 \
4--config_overrides="n_embd=128,n_head=4,n_layer=2,n_positions=256" \
5--per_device_train_batch_size 8 \
6--per_device_eval_batch_size 8 \
7--train_file $1 \
8--validation_file $2 \
9--remove_unused_columns False \
10--preprocessing_num_workers $6 \
11--pad_to_max_length \
12--max_train_samples 100 \
13--max_eval_samples 100 \
14--line_by_line \
15--do_train \
16--do_eval \
17--max_seq_length 256 \
18--num_train_epochs 1 \
19--overwrite_output_dir \
20--output_dir $3 \
21--report_to none \
22--cache_dir /tmp/ \
23--evaluation_strategy steps \
24--logging_steps 10 \
25--save_steps 10 \
26--save_total_limit 2
We now train the CLM with the test script file and share a snapshot of the training process
1$ sh scripts/run_clm_test.sh demo/data/lm/english_sample.txt demo/data/lm/english_sample.txt demo/model/clm/ demo/model/tokenizer/ gpt2 16
2
304/07/2022 21:16:29 - WARNING - __main__ - You are instantiating a new config instance from scratch.
404/07/2022 21:16:29 - INFO - __main__ - Overriding config: n_embd=128,n_head=4,n_layer=4,n_positions=256
5
6[INFO|tokenization_utils_base.py:1671] 2022-04-07 21:16:29,818 >> Didn't find file demo/model/tokenizer/vocab.json. We won't load it.
7[INFO|tokenization_utils_base.py:1671] 2022-04-07 21:16:29,818 >> Didn't find file demo/model/tokenizer/merges.txt. We won't load it.
8[INFO|tokenization_utils_base.py:1740] 2022-04-07 21:16:29,818 >> loading file None
9[INFO|tokenization_utils_base.py:1740] 2022-04-07 21:16:29,818 >> loading file None
10[INFO|tokenization_utils_base.py:1740] 2022-04-07 21:16:29,818 >> loading file demo/model/tokenizer/tokenizer.json
11[INFO|tokenization_utils_base.py:1740] 2022-04-07 21:16:29,818 >> loading file demo/model/tokenizer/added_tokens.json
12[INFO|tokenization_utils_base.py:1740] 2022-04-07 21:16:29,818 >> loading file demo/model/tokenizer/special_tokens_map.json
13[INFO|tokenization_utils_base.py:1740] 2022-04-07 21:16:29,818 >> loading file demo/model/tokenizer/tokenizer_config.json
14
1504/07/2022 21:16:30 - INFO - __main__ - Training new model from scratch - Total size=6.92M params
16
17[INFO|trainer.py:1204] 2022-04-07 20:12:42,760 >> ***** Running training *****
18[INFO|trainer.py:1205] 2022-04-07 20:12:42,760 >> Num examples = 1895
19[INFO|trainer.py:1206] 2022-04-07 20:12:42,760 >> Num Epochs = 1
20[INFO|trainer.py:1207] 2022-04-07 20:12:42,760 >> Instantaneous batch size per device = 8
21[INFO|trainer.py:1208] 2022-04-07 20:12:42,760 >> Total train batch size (w. parallel, distributed & accumulation) = 8
22[INFO|trainer.py:1209] 2022-04-07 20:12:42,760 >> Gradient Accumulation steps = 1
23[INFO|trainer.py:1210] 2022-04-07 20:12:42,760 >> Total optimization steps = 237
24
25{'loss': 5.9329, 'learning_rate': 2.8902953586497894e-05, 'epoch': 0.42}
26{'eval_loss': 5.720452785491943, 'eval_runtime': 30.5425, 'eval_samples_per_second': 62.045, 'eval_steps_per_second': 7.76, 'epoch': 0.42}
27
28{'loss': 5.6865, 'learning_rate': 7.805907172995782e-06, 'epoch': 0.84}
29{'eval_loss': 5.609338760375977, 'eval_runtime': 30.8089, 'eval_samples_per_second': 61.508, 'eval_steps_per_second': 7.693, 'epoch': 0.84}
30
31Training completed. Do not forget to share your model on huggingface.co/models =)
32
33{'train_runtime': 220.6908, 'train_samples_per_second': 8.587, 'train_steps_per_second': 1.074, 'train_loss': 5.776248851405921, 'epoch': 1.0}
34
35***** eval metrics *****
36epoch = 1.0
37eval_loss = 5.6093
38eval_runtime = 0:00:36.93
39eval_samples = 1895
40eval_samples_per_second = 51.301
41eval_steps_per_second = 6.416
42perplexity = 272.9637
43
44[INFO|modelcard.py:456] 2022-04-07 21:28:38,572 >> Dropping the following result as it does not have all the necessary fields:
45
46{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
The trained model is present in the following folder and ready to fine-tune
1$ ls demo/model/clm/
2README.md
3added_tokens.json
4all_results.json
5checkpoint-200/
6checkpoint-100/
7config.json
8eval_results.json
9pytorch_model.bin
10special_tokens_map.json
11tokenizer_config.json
12tokenizer.json
13trainer_state.json
14train_results.json
15training_args.bin
16vocab.txt