论文信息 - Repartitioning of the ComplexWebQuestions Dataset

Repartitioning of the ComplexWebQuestions Dataset

Recently, Talmor and Berant (2018) introduced ComplexWebQuestions - a dataset focused on answering complex questions by decomposing them into a sequence of simpler questions and extracting the answer from retrieved web snippets. In their work the authors used a pre-trained reading comprehension (RC) model (Salant and Berant, 2018) to extract the answer from the web snippets. In this short note we show that training a RC model directly on the training data of ComplexWebQuestions reveals a leakage from the training set to the test set that allows to obtain unreasonably high performance. As a solution, we construct a new partitioning of ComplexWebQuestions that does not suffer from this leakage and publicly release it. We also perform an empirical evaluation on these two datasets and show that training a RC model on the training data substantially improves state-of-the-art performance.

Jonathan Berant | Alon Talmor

[1] Jonathan Berant,et al. Building a Semantic Parser Overnight , 2015, ACL.

[2] Jonathan Berant,et al. Contextualized Word Representations for Reading Comprehension , 2017, NAACL.

[3] Christopher Clark,et al. Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[4] Ming-Wei Chang,et al. The Value of Semantic Parse Labeling for Knowledge Base Question Answering , 2016, ACL.

[5] Jonathan Berant,et al. The Web as a Knowledge-Base for Answering Complex Questions , 2018, NAACL.

[6] Praveen Paritosh,et al. Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[7] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.