This page contains the baseline scores of finetuning bert-base-uncased pretrained model over the 36 tasks, aggregated over runs of 20 random initializations.
avg | 20_newsgroup | ag_news | amazon_reviews_multi | anli | boolq | cb | cola | copa | dbpedia | esnli | financial_phrasebank | imdb | isear | mnli | mrpc | multirc | poem_sentiment | qnli | qqp | rotten_tomatoes | rte | sst2 | sst_5bins | stsb | trec_coarse | trec_fine | tweet_ev_emoji | tweet_ev_emotion | tweet_ev_hate | tweet_ev_irony | tweet_ev_offensive | tweet_ev_sentiment | wic | wnli | wsc | yahoo_answers | |
mean | 72.20 | 83.05 | 89.59 | 65.92 | 46.95 | 68.96 | 64.38 | 81.83 | 49.45 | 78.16 | 89.70 | 68.53 | 91.58 | 69.07 | 83.73 | 81.99 | 59.97 | 66.68 | 89.88 | 90.27 | 84.85 | 59.98 | 91.97 | 52.80 | 85.86 | 96.06 | 68.33 | 36.01 | 79.91 | 52.85 | 67.76 | 85.37 | 69.48 | 63.25 | 50.56 | 62.12 | 72.32 |
std | 0.55 | 0.43 | 0.29 | 0.31 | 0.52 | 1.20 | 10.01 | 0.49 | 5.36 | 0.67 | 1.33 | 10.67 | 0.14 | 0.46 | 0.23 | 1.61 | 1.40 | 0.90 | 0.75 | 0.54 | 0.47 | 2.04 | 0.42 | 0.56 | 0.45 | 0.63 | 2.83 | 0.60 | 0.65 | 1.20 | 1.41 | 0.63 | 0.73 | 1.63 | 6.41 | 4.55 | 0.24 |
Download full repetitions table: csv