π Welcome to MTRAGEval! MTRAGEval is a task for Evaluating Multi-Turn RAG Conversations at SemEval 2026. π
Please fill out this form to register for our task if you plan to participate in MTRAGEval: Registration Form
The MTRAG Benchmark is released as the trial, training, and validation data for MTRAGEval. You can access the full dataset here. We will release new evaluation data during the evalution phase.
Note: The MTRAG Benchmark includes metadata that describes dimensions for each task including the question type (e.g. factoid), answerability (e.g. unanswerable, answerable), and multi-turn type (e.g. follow-up, clarification). This information will NOT be provided during evaluation. We will only provide the corpus domain e.g. ClapNQ, Govt.
Input: You are given a set of tasks, where each task contains (a) a conversation comprising of a set of user question/agent response turns ending with a user question and (b) the corresponding document corpus.
Output: For each task, you are asked to return an ordered list of 10 passages from the document corpus that are relevant to the last user question (with more relevant passages appearing earlier in the list). Note that your submission for this task will only be evaluated on the subset of answerable questions; however, to avoid information leak for the other tasks, you will not be provided beforehand with info on which questions are answerable or not.
Note: For task A, we will produce results @ 1, 3, 5, 10 so you should make sure to return 10 contexts.
Input: You are given a set of tasks, where each task contains (a) a conversation comprising of a set of user question/agent response turns ending with a user question and (b) a set of relevant passages for the last user question.
Output: For each task, you are asked to generate an agent response for the last user question (which should be faithful w.r.t. the relevant passages).
Input: You are given a set of tasks, where each task contains (a) a conversation comprising of a set of user question/agent response turns ending with a user question and (b) the corresponding document corpus.
Output: For each task, you are asked to first retrieve up to 10 passages from the documents corpus that are relevant to the user question and use them to generate an agent response for the last user question (which should be faithful w.r.t. the retrieved passages).
Note: Your submission for Task C will be evaluated mainly based on the generated agent response; the intermediate list of retrieved passages is part of the evaluation for faithfulness. We allow a maximum of 10 passages to be returned but you do not need to use the full amount (in our experiments in the MTRAG paper we used 5). Returning more contexts may reduce the faithfulness score. We recommend returning at most 5 as used in our paper.
Read more about our tasks in our proposal
The evaluation data follows the MTRAG data format. Input and Output format for the task, sample data, and Format Checker scripts are available on the GitHub repo! Please visit the evaluation README for more information.
Evaluation and Format Checker scripts are available on the GitHub repo! Please visit the evaluation README for more information.
We will use the Evaluation Scripts provided above to evaluate each teamβs system.
The ranking for the retrieval Task A will be using nDCG.
The ranking for the generation Tasks: B and C will be computed as the harmonic mean of RL_F, RB_llm and RB_alg. We present the ranking of the results provided in the MTRAG paper to illustrate the ranking. Please note, that ranking is not the only indication of a strong system. In particular, the difference in rank may be large if several systems achieve close scores as in the results provided below.
| Rank | Task B (Reference) | Harmonic Mean | Rank | Task C (RAG) | Harmonic Mean |
|---|---|---|---|---|---|
| 1st | Reference | 0.89 | 1st | Reference | 0.81 |
| 2nd | GPT-4o | 0.60 | 2nd | GPT-4o | 0.53 |
| 2nd | Llama-3.1-405B-Instruct | 0.60 | 2nd | Llama-3.1-405B-Instruct | 0.53 |
| 4th | GPT-4o-mini | 0.57 | 4th | Qwen-2.5-(72B) | 0.52 |
| 4th | Qwen-2.5-(72B) | 0.57 | 4th | Llama-3.1-70B-Instruct | 0.52 |
| 4th | Command-R+(104B) | 0.57 | 6th | GPT-4o-mini | 0.51 |
| 7th | Qwen-2.5-(7B) | 0.55 | 6th | Command-R+(104B) | 0.51 |
| 8th | Llama-3.1-70B-Instruct | 0.54 | 6th | Qwen-2.5-(7B) | 0.51 |
| 9th | Mixtral-8x22B-Instruct | 0.51 | 9th | Mixtral-8x22B-Instruct | 0.48 |
| 10th | Llama-3.1-8B-Instruct | 0.45 | 10th | Llama-3.1-8B-Instruct | 0.45 |
Task Submission will only be open during the evaluation phase. All submissions will be via a google form that will be provided per task.
Because LLM judge evaluation is resource-intensive, weβre restricting submission to one run per task per team (if you submit more than one we will only evaluate the last one). We will release the predictions after the evaluation phase so that you can try out and report other techniques in your paper.
The evaluation data will be provided to all registered participants at the start of each evaluation phase.
Note: The MTRAG Benchmark includes metadata that describes dimensions for each task including the question type (e.g. factoid), answerability (e.g. unanswerable, answerable), and multi-turn type (e.g. follow-up, clarification). This information will NOT be provided during evaluation. We will only provide the corpus domain e.g. ClapNQ, Govt.
Sara Rosenthal βοΈ
Yannis Katsis βοΈ
Vraj Shah βοΈ
Marina Danilevsky βοΈ