This page contains the baseline scores of finetuning bert-base-cased pretrained model over the 36 tasks, aggregated over runs of 20 random initializations.
avg | 20_newsgroup | ag_news | amazon_reviews_multi | anli | boolq | cb | cola | copa | dbpedia | esnli | financial_phrasebank | imdb | isear | mnli | mrpc | multirc | poem_sentiment | qnli | qqp | rotten_tomatoes | rte | sst2 | sst_5bins | stsb | trec_coarse | trec_fine | tweet_ev_emoji | tweet_ev_emotion | tweet_ev_hate | tweet_ev_irony | tweet_ev_offensive | tweet_ev_sentiment | wic | wnli | wsc | yahoo_answers | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | 72.43 | 81.74 | 89.06 | 65.71 | 46.57 | 68.27 | 63.48 | 81.85 | 52.15 | 78.77 | 89.64 | 68.36 | 91.15 | 68.39 | 83.39 | 82.93 | 60.47 | 67.69 | 90.00 | 89.95 | 84.55 | 62.64 | 91.49 | 51.41 | 84.52 | 96.63 | 72.98 | 44.24 | 78.84 | 52.78 | 65.20 | 84.25 | 68.23 | 64.78 | 52.32 | 61.92 | 71.03 |
std | 0.48 | 1.83 | 0.30 | 0.30 | 0.60 | 1.40 | 6.64 | 0.81 | 4.36 | 0.45 | 0.66 | 14.40 | 0.18 | 1.71 | 0.24 | 2.32 | 1.18 | 2.58 | 0.63 | 0.33 | 0.40 | 2.17 | 0.42 | 0.69 | 0.61 | 0.69 | 1.72 | 0.76 | 1.07 | 1.45 | 2.00 | 0.68 | 0.67 | 2.04 | 6.02 | 5.62 | 0.33 |
Download full repetitions table: csv