FAQ

Why should I use a different base model than the vanilla pretrained model?

It improves results significantly in most cases, and on average. The best base RoBERTa-base models improve results in 75% of the tasks we evaluated, with a median gain of 2.5 accuracy points. So if you had have to choose one base model, it would be best to use these top ranked models.

Can I get worse results from training over the top ranked base model when compared to the vanilla model?

Yes. For example, in RoBERTa-base, about 1 in 4 tasks perform slightly better on the pretrained model. Furthermore, difference in seed randomization can yield variance in results. The best approach is to assess multiple models and evaluate on dev data.

Some base models were trained on tasks that are included in the evaluation task set, while other base models were not. Is it a fair comparision?

Yes. During evaluation, each base model is finetuned on the evaluation task’s train set and evaluated on the evaluation task’s test set. The fact that the base model was trained before on the same train set, does not give it access to any additional training data of the evaluation task. To convince yourself, consider the impact of running another training epoch on a model already finetuned and converged for a specific task - this additional epoch will have no effect.

With that, linear probing (training only the classification head) does give an advantage to models that were already finetuned on the evaluated dataset.

When shouldn’t I use one of the recommended base models?

You should always review the base model license and factsheet to ensure they meet the requirements for your particular use case. You should always only download models and datasets from sources that you trust. Downloading of models and datasets can run code your machine (see for example HuggingFace warning). We do not certify the quality and usability of models listed.

Which architectures are supported?

We gradually add architectures of sequence to sequence and classification, the full list is here. Other architectures will be added soon. Want us to add a specific model? Please contact us and say so. If you have recommended training parameters, it is even better, send them too.

Could you test my model?

Sure. If the architecture is not supported, see the above question. You can add it to HuggingFace and wait.

How frequently do you update the leaderboard?

We will update the results monthly.

How do you assess the models?

We train a linear probing classification head for the MNLI on each candidate model. We take each of the top 5 ranking models, and we fine-tune them on the 36 classification tasks (Consisting of sentiment, NLI, Twitter, topic classification and other general classification tasks). We compare to the baseline of the vanilla model which is also trained and assessed on 5 seeds. We use the following hyperparameters:

model name: roberta-base, tokenizer: roberta-base, train size: inf, val size: inf, test size: inf, epochs: 10, learning rate: 5e-5,linear,0.0006, early stop epsilon: 0.001, batch size: 256, patience: 20 * 50 * 256, validate every: 50 * 256, seed: 0, l2 reg: 0.0, classification model: , optimizer: adamw, weight decay: 0.01

Which datasets are used?

We use the following datasets:

Entailments: MNLI, ESNLI, QNLI, QQP, RTE, WNLI, ANLI
Sentiment:SST-2 SST-5, POEM SENTIMENT, IMDB, Rotten Tomatoes, Amazon reviews, Financial phrasebank
Topic Classification: AG NEWS, ISEAR, Yahoo answers, DBpedia, 20 NEWSGROUP, TREC fine, TREC coarse
Twitter: Tweet Emoji, Tweet Emotion, Tweet Hate, Tweet Irony, Tweet Offensive, Tweet Sentiment
Others: CoLA, STS-B, QQP, QNLI, RTE, WNLI, MRPC, BoolQ, CB, COPA, WIC, WSC

I have another question.

Please contact us