run_mlm module
Fine-tuning the library models for masked language modeling (BERT, ALBERT, RoBERTa…) on a text file or a dataset.
Here is the full list of checkpoints on the hub that can be fine-tuned by this script: https://huggingface.co/models?filter=fill-mask
- class run_mlm.DataTrainingArguments(dataset_name: Optional[str] = None, dataset_config_name: Optional[str] = None, train_file: Optional[str] = None, validation_file: Optional[str] = None, test_file: Optional[str] = None, max_train_samples: Optional[int] = None, max_eval_samples: Optional[int] = None, max_seq_length: Optional[int] = 512, line_by_line: bool = False, pad_to_max_length: bool = False, overwrite_cache: bool = False, validation_split_percentage: Optional[int] = 5, preprocessing_num_workers: Optional[int] = None, keep_linebreaks: bool = False, mlm_probability: float = 0.15)[source]
Bases:
object
Arguments pertaining to what data we are going to input our model for training and eval.
- dataset_config_name: Optional[str] = None
- dataset_name: Optional[str] = None
- keep_linebreaks: bool = False
- line_by_line: bool = False
- max_eval_samples: Optional[int] = None
- max_seq_length: Optional[int] = 512
- max_train_samples: Optional[int] = None
- mlm_probability: float = 0.15
- overwrite_cache: bool = False
- pad_to_max_length: bool = False
- preprocessing_num_workers: Optional[int] = None
- test_file: Optional[str] = None
- train_file: Optional[str] = None
- validation_file: Optional[str] = None
- validation_split_percentage: Optional[int] = 5
- class run_mlm.ModelArguments(model_name_or_path: Optional[str] = None, model_type: Optional[str] = None, config_overrides: Optional[str] = None, config_name: Optional[str] = None, tokenizer_name: Optional[str] = None, cache_dir: Optional[str] = None, use_fast_tokenizer: bool = True, model_revision: str = 'main', use_auth_token: bool = False, freeze_token_embed: bool = False, pretrained_token_embed: Optional[str] = None)[source]
Bases:
object
Arguments pertaining to which model/config/tokenizer we are going to fine-tune, or train from scratch.
- cache_dir: Optional[str] = None
- config_name: Optional[str] = None
- config_overrides: Optional[str] = None
- freeze_token_embed: bool = False
- model_name_or_path: Optional[str] = None
- model_revision: str = 'main'
- model_type: Optional[str] = None
- pretrained_token_embed: Optional[str] = None
- tokenizer_name: Optional[str] = None
- use_auth_token: bool = False
- use_fast_tokenizer: bool = True
- run_mlm.read_txt_embeddings(file_name)[source]
Load pre-trained word embeddings :param filename: the path to the pre-trained word embedding file in Glove format :type filename: str
- Returns
wordEmbedding (numpy nd-array) – Numpy array of embeddings
dictionary (dict) – Dictionary of word to index mappings