This page contains the baseline scores of finetuning roberta-base pretrained model over the 36 tasks, aggregated over runs of 20 random initializations.
avg | 20_newsgroup | ag_news | amazon_reviews_multi | anli | boolq | cb | cola | copa | dbpedia | esnli | financial_phrasebank | imdb | isear | mnli | mrpc | multirc | poem_sentiment | qnli | qqp | rotten_tomatoes | rte | sst2 | sst_5bins | stsb | trec_coarse | trec_fine | tweet_ev_emoji | tweet_ev_emotion | tweet_ev_hate | tweet_ev_irony | tweet_ev_offensive | tweet_ev_sentiment | wic | wnli | wsc | yahoo_answers | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | 76.22 | 85.28 | 89.77 | 66.58 | 50.35 | 78.69 | 67.77 | 83.53 | 48.70 | 77.30 | 90.99 | 85.11 | 93.90 | 72.47 | 86.98 | 87.87 | 61.22 | 83.94 | 92.41 | 90.71 | 88.42 | 72.40 | 94.12 | 56.68 | 89.92 | 97.11 | 87.76 | 46.30 | 81.82 | 52.89 | 71.56 | 84.55 | 71.03 | 65.48 | 54.79 | 63.27 | 72.40 |
std | 0.36 | 0.31 | 0.33 | 0.29 | 4.06 | 0.75 | 4.24 | 0.55 | 5.92 | 0.56 | 0.18 | 2.47 | 0.12 | 0.58 | 0.26 | 0.98 | 1.58 | 3.65 | 0.28 | 0.46 | 0.64 | 2.09 | 0.38 | 0.92 | 0.18 | 0.48 | 0.74 | 0.84 | 0.45 | 1.78 | 1.70 | 0.71 | 0.49 | 3.92 | 3.82 | 0.86 | 0.39 |
Download full repetitions table: csv