mt-rag-benchmark

MTRAGEval

πŸŽ‰ Welcome to MTRAGEval! MTRAGEval is a task for Evaluating Multi-Turn RAG Conversations at SemEval 2026. πŸŽ‰

πŸ“ Registration

Please fill out this form to register for our task if you plan to participate in MTRAGEval: Registration Form

Join Our Mailing List!

MTRAGEval Mailing List

Training and Trial Data

The MTRAG Benchmark is released as the trial, training, and validation data for MTRAGEval. You can access the full dataset here. We will release new evaluation data during the evalution phase.

Note: The MTRAG Benchmark includes metadata that describes dimensions for each task including the question type (e.g. factoid), answerability (e.g. unanswerable, answerable), and multi-turn type (e.g. follow-up, clarification). This information will NOT be provided during evaluation. We will only provide the corpus domain e.g. ClapNQ, Govt.

πŸ“‹ Tasks

Task A: Retrieval Only

Input: You are given a set of tasks, where each task contains (a) a conversation comprising of a set of user question/agent response turns ending with a user question and (b) the corresponding document corpus.

Output: For each task, you are asked to return an ordered list of 10 passages from the document corpus that are relevant to the last user question (with more relevant passages appearing earlier in the list). Note that your submission for this task will only be evaluated on the subset of answerable questions; however, to avoid information leak for the other tasks, you will not be provided beforehand with info on which questions are answerable or not.

Note: For task A, we will produce results @ 1, 3, 5, 10 so you should make sure to return 10 contexts.

Task B: Generation with Reference Passages (Reference)

Input: You are given a set of tasks, where each task contains (a) a conversation comprising of a set of user question/agent response turns ending with a user question and (b) a set of relevant passages for the last user question.

Output: For each task, you are asked to generate an agent response for the last user question (which should be faithful w.r.t. the relevant passages).

Task C: Generation with Retrieved Passages (RAG)

Input: You are given a set of tasks, where each task contains (a) a conversation comprising of a set of user question/agent response turns ending with a user question and (b) the corresponding document corpus.

Output: For each task, you are asked to first retrieve up to 10 passages from the documents corpus that are relevant to the user question and use them to generate an agent response for the last user question (which should be faithful w.r.t. the retrieved passages).

Note: Your submission for Task C will be evaluated mainly based on the generated agent response; the intermediate list of retrieved passages is part of the evaluation for faithfulness. We allow a maximum of 10 passages to be returned but you do not need to use the full amount (in our experiments in the MTRAG paper we used 5). Returning more contexts may reduce the faithfulness score. We recommend returning at most 5 as used in our paper.

Read more about our tasks in our proposal

Data Format

The evaluation data follows the MTRAG data format. Input and Output format for the task, sample data, and Format Checker scripts are available on the GitHub repo! Please visit the evaluation README for more information.

Evaluation

Evaluation Scripts

Evaluation and Format Checker scripts are available on the GitHub repo! Please visit the evaluation README for more information.

πŸ† Leaderboard Ranking

We will use the Evaluation Scripts provided above to evaluate each team’s system.

The ranking for the retrieval Task A will be using nDCG.

The ranking for the generation Tasks: B and C will be computed as the harmonic mean of RL_F, RB_llm and RB_alg. We present the ranking of the results provided in the MTRAG paper to illustrate the ranking. Please note, that ranking is not the only indication of a strong system. In particular, the difference in rank may be large if several systems achieve close scores as in the results provided below.

Rank Task B (Reference) Harmonic Mean Rank Task C (RAG) Harmonic Mean
1st Reference 0.89 1st Reference 0.81
2nd GPT-4o 0.60 2nd GPT-4o 0.53
2nd Llama-3.1-405B-Instruct 0.60 2nd Llama-3.1-405B-Instruct 0.53
4th GPT-4o-mini 0.57 4th Qwen-2.5-(72B) 0.52
4th Qwen-2.5-(72B) 0.57 4th Llama-3.1-70B-Instruct 0.52
4th Command-R+(104B) 0.57 6th GPT-4o-mini 0.51
7th Qwen-2.5-(7B) 0.55 6th Command-R+(104B) 0.51
8th Llama-3.1-70B-Instruct 0.54 6th Qwen-2.5-(7B) 0.51
9th Mixtral-8x22B-Instruct 0.51 9th Mixtral-8x22B-Instruct 0.48
10th Llama-3.1-8B-Instruct 0.45 10th Llama-3.1-8B-Instruct 0.45

Task Submission and Evaluation Data

Task Submission will only be open during the evaluation phase. All submissions will be via a google form that will be provided per task.

Because LLM judge evaluation is resource-intensive, we’re restricting submission to one run per task per team (if you submit more than one we will only evaluate the last one). We will release the predictions after the evaluation phase so that you can try out and report other techniques in your paper.

The evaluation data will be provided to all registered participants at the start of each evaluation phase.

Note: The MTRAG Benchmark includes metadata that describes dimensions for each task including the question type (e.g. factoid), answerability (e.g. unanswerable, answerable), and multi-turn type (e.g. follow-up, clarification). This information will NOT be provided during evaluation. We will only provide the corpus domain e.g. ClapNQ, Govt.

πŸ“† Timeline (Tentative)

Task Organizers

Sara Rosenthal βœ‰οΈ
Yannis Katsis βœ‰οΈ
Vraj Shah βœ‰οΈ
Marina Danilevsky βœ‰οΈ