On Paraphrase Identification Corpora

We analyze in this paper a number of data sets proposed over the last decade or so for the task of paraphrase identification. The goal of the analysis is to identify the advantages as well as shortcomings of the previously proposed data sets. Based on the analysis, we then make recommendations about how to improve the process of creating and using such data sets for evaluating in the future approaches to the task of paraphrase identification or the more general task of semantic similarity. The recommendations are meant to improve our understanding of what a paraphrase is, offer a more fair ground for comparing approaches, increase the diversity of actual linguistic phenomena that future data sets will cover, and offer ways to improve our understanding of the contributions of various modules or approaches proposed for solving the task of paraphrase identification or similar tasks.

[1]  David J. Weir,et al.  The Distributional Similarity of Sub-Parses , 2005, EMSEE@ACL.

[2]  Iryna Gurevych,et al.  Answering Learners’ Questions by Retrieving Question Paraphrases from Social Q&A Sites , 2008 .

[3]  Vasile Rus,et al.  SEMILAR: The Semantic Similarity Toolkit , 2013, ACL.

[4]  Mirella Lapata,et al.  Constructing Corpora for the Development and Evaluation of Paraphrase Systems , 2008, CL.

[5]  Nobal B. Niraula,et al.  The SIMILAR Corpus: A Resource To Foster The Qualitative Understanding of Semantic Similarity of Texts , 2012 .

[6]  Vasile Rus,et al.  Automatic Detection of Student Mental Models During Prior Knowledge Activation in MetaTutor , 2009, EDM.

[7]  Chris Brew,et al.  SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge , 2013, *SEMEVAL.

[8]  Mihai C. Lintean,et al.  Measuring semantic similarity: representations and methods , 2011 .

[9]  Michaela Regneri,et al.  Using Discourse Information for Paraphrase Extraction , 2012, EMNLP.

[10]  Kathleen R. McKeown,et al.  Information fusion for multidocument summarization: paraphrasing and generation , 2003 .

[11]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[12]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[13]  Jon Patrick,et al.  Paraphrase Identification by Text Canonicalization , 2005, ALTA.

[14]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[15]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[16]  Chris Brockett,et al.  Support Vector Machines for Paraphrase Identification and Corpus Construction , 2005, IJCNLP.

[17]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[18]  Arthur C. Graesser,et al.  Deeper Natural Language Processing for Evaluating Student Answers in Intelligent Tutoring Systems , 2006, AAAI.

[19]  Eneko Agirre,et al.  *SEM 2013 shared task: Semantic Textual Similarity , 2013, *SEMEVAL.