π Welcome to MTRAGEval! MTRAGEval is a task for Evaluating Multi-Turn RAG Conversations at SemEval 2026. π
Please fill out this form to register for our task if you plan to participate in MTRAGEval: Registration Form
The MTRAG Benchmark is released as the trial and training data for MTRAG. You can access the full dataset here
Note: The MTRAG Benchmark includes metadata that describes dimensions for each task including the question type (e.g. factoid), answerability (e.g. unanswerable, answerable), and multi-turn type (e.g. follow-up, clarification). This information will NOT be provided during evaluation. We will only provide the corpus domain e.g. ClapNQ, Govt.
Read more about our tasks in our proposal
Retrieval and Generation Evaluation Scripts are available on the GitHub repo! Please visit the evaluation README for more information.
β Coming Soon: Validation Script
We will use the Evaluation Scripts provided above to evaluate each teamβs system.
The ranking for the retrieval Task A will be using nDCG.
The ranking for the generation Tasks: B and C will be computed as the harmonic mean of RL_F, RB_llm and RB_alg. We present the ranking of the results provided in the MTRAG paper to illustrate the ranking. Please note, that ranking is not the only indication of a strong system. In particular, the difference in rank may be large if several systems achieve close scores as in the results provided below.
| Rank | Task B (Reference) | Harmonic Mean | Rank | Task C (RAG) | Harmonic Mean |
|---|---|---|---|---|---|
| 1st | Reference | 0.89 | 1st | Reference | 0.81 |
| 2nd | GPT-4o | 0.60 | 2nd | GPT-4o | 0.53 |
| 2nd | Llama-3.1-405B-Instruct | 0.60 | 2nd | Llama-3.1-405B-Instruct | 0.53 |
| 4th | GPT-4o-mini | 0.57 | 4th | Qwen-2.5-(72B) | 0.52 |
| 4th | Qwen-2.5-(72B) | 0.57 | 4th | Llama-3.1-70B-Instruct | 0.52 |
| 4th | Command-R+(104B) | 0.57 | 6th | GPT-4o-mini | 0.51 |
| 7th | Qwen-2.5-(7B) | 0.55 | 6th | Command-R+(104B) | 0.51 |
| 8th | Llama-3.1-70B-Instruct | 0.54 | 6th | Qwen-2.5-(7B) | 0.51 |
| 9th | Mixtral-8x22B-Instruct | 0.51 | 9th | Mixtral-8x22B-Instruct | 0.48 |
| 10th | Llama-3.1-8B-Instruct | 0.45 | 10th | Llama-3.1-8B-Instruct | 0.45 |
Task Submission will only be open during the evaluation phase. All submissions will be via a google form that will be provided per task.
The evaluation data will be provided to all registered participants at the start of each evaluation phase.
Note: The MTRAG Benchmark includes metadata that describes dimensions for each task including the question type (e.g. factoid), answerability (e.g. unanswerable, answerable), and multi-turn type (e.g. follow-up, clarification). This information will NOT be provided during evaluation. We will only provide the corpus domain e.g. ClapNQ, Govt.
Sara Rosenthal βοΈ
Yannis Katsis βοΈ
Vraj Shah βοΈ
Marina Danilevsky βοΈ