run_clm module

Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, …) on a text file or a dataset. Here is the full list of checkpoints on the hub that can be fine-tuned by this script: https://huggingface.co/models?filter=causal-lm

class run_clm.DataTrainingArguments(dataset_name: Optional[str] = None, dataset_config_name: Optional[str] = None, train_file: Optional[str] = None, validation_file: Optional[str] = None, test_file: Optional[str] = None, max_train_samples: Optional[int] = None, max_eval_samples: Optional[int] = None, max_seq_length: Optional[int] = 1024, line_by_line: bool = False, pad_to_max_length: bool = False, block_size: Optional[int] = None, overwrite_cache: bool = False, validation_split_percentage: Optional[int] = 5, preprocessing_num_workers: Optional[int] = None, keep_linebreaks: bool = True)[source]

Bases: object

Arguments pertaining to what data we are going to input our model for training and eval.

block_size: Optional[int] = None
dataset_config_name: Optional[str] = None
dataset_name: Optional[str] = None
keep_linebreaks: bool = True
line_by_line: bool = False
max_eval_samples: Optional[int] = None
max_seq_length: Optional[int] = 1024
max_train_samples: Optional[int] = None
overwrite_cache: bool = False
pad_to_max_length: bool = False
preprocessing_num_workers: Optional[int] = None
test_file: Optional[str] = None
train_file: Optional[str] = None
validation_file: Optional[str] = None
validation_split_percentage: Optional[int] = 5
class run_clm.ModelArguments(model_name_or_path: Optional[str] = None, model_type: Optional[str] = None, config_overrides: Optional[str] = None, config_name: Optional[str] = None, tokenizer_name: Optional[str] = None, cache_dir: Optional[str] = None, use_fast_tokenizer: bool = True, model_revision: str = 'main', use_auth_token: bool = False)[source]

Bases: object

Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.

cache_dir: Optional[str] = None
config_name: Optional[str] = None
config_overrides: Optional[str] = None
model_name_or_path: Optional[str] = None
model_revision: str = 'main'
model_type: Optional[str] = None
tokenizer_name: Optional[str] = None
use_auth_token: bool = False
use_fast_tokenizer: bool = True
run_clm.main()[source]