Evaluations¶
Metrics¶
For each test instance, we will expect you to return a set of 100 choices (candidate ids) from the set of possible follow-up sentences and a probability distribution over those 100 choices. As competition metrics we will compute range of scores, including recall@k, MRR(mean reciprocal rank) and MAP(mean average precision).
Following evaluation metrics will be used to evaluate your submissions.
Sub-Task
Ubuntu
Advising
1
Recall @1, Recall @10, Recall @50, MRR
Recall @1, Recall @10, Recall @50, MRR
2
Recall @1, Recall @10, Recall @50, MRR
3
Recall @1, Recall @10, Recall @50, MRR, MAP
4
Recall @1, Recall @10, Recall @50, MRR
Recall @1, Recall @10, Recall @50, MRR
5
Recall @1, Recall @10, Recall @50, MRR
Recall @1, Recall @10, Recall @50, MRR
Note: We will evaluate MAP for sub-task 3 with Advising data as the you are supposed to return the correct response and all the paraphrases associated with it.
Best Scores¶
The ranking considers the average of Recall@10 and MRR. Best Recall@10 and MRR scores for each subtask is is shown in the below table.
Recall@10
Sub-Task
Ubuntu
Advising-Case-1
Advising-Case-2
1
0.902
0.85
0.63
2
0.361
NA
NA
3
NA
0.906
0.75
4
0.739
0.652
0.508
5
0.905
0.864
0.63
MRR
Sub-Task
Ubuntu
Advising-Case-1
Advising-Case-2
1
0.7350
0.6078
0.3390
2
0.2528
NA
NA
3
NA
0.6238
0.4341
4
0.5891
0.3495
0.2422
5
0.7399
0.6455
0.3390