The datasets can be downloaded from the following links.

Note: the Ubuntu data is NOT the same as the previous Ubuntu dataset from Lowe et. al (2015) <>. It is a new resource, described in the following paper:

  author    = {Jonathan K. Kummerfeld, Sai R. Gouravajhala, Joseph Peper, Vignesh Athreya, Chulaka Gunasekara, Jatin Ganhotra, Siva Sankalp Patel, Lazaros Polymenakos, and Walter S. Lasecki},
  title     = {Analyzing Assumptions in Conversation Disentanglement Research Through the Lens of a New Dataset and Model},
  journal   = {ArXiv e-prints},
  archivePrefix = {arXiv},
  eprint    = {1810.11118},
  primaryClass = {cs.CL},
  year      = {2018},
  month     = {October},
  url       = {},

Training and Validation

Additionally, for the Advising data, we are providing a form of the data with the original dialogs and their paraphrases before remixing. This can be used for training in any subtask, and can be downloaded here. The global candidate pool for the sub-task 2, should be shared across training, validation and test datasets for sub-task 2.