Automatically Constructing a Corpus of Sentential Paraphrases

An obstacle to research in automatic paraphrase identification and generation is the lack of large-scale, publiclyavailable labeled corpora of sentential paraphrases. This paper describes the creation of the recently-released Microsoft Research Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judgment as to whether the pair constitutes a paraphrase. The corpus was created using heuristic extraction techniques in conjunction with an SVM-based classifier to select likely sentence-level paraphrases from a large corpus of topicclustered news data. These pairs were then submitted to human judges, who confirmed that 67% were in fact semantically equivalent. In addition to describing the corpus itself, we explore a number of issues that arose in defining guidelines for the human raters.

[1]  John D. Burger,et al.  Generating an Entailment Corpus from News Headlines , 2005, EMSEE@ACL.

[2]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[3]  Daniel Marcu,et al.  Induction of Word and Phrase Alignments for Automatic Document Summarization , 2005, CL.

[4]  Hozumi Tanaka,et al.  Towards a Thesaurus of Predicates , 2002, LREC.

[5]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[6]  David J. Weir,et al.  The Distributional Similarity of Sub-Parses , 2005, EMSEE@ACL.

[7]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[8]  Dekai Wu,et al.  Recognizing Paraphrases and Textual Entailment Using Inversion Transduction Grammars , 2005, EMSEE@ACL.

[9]  Chris Brockett,et al.  Echo Chamber: A Game for Eliciting a Colloquial Paraphrase Corpus , 2005, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[10]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[11]  Taro Watanabe,et al.  Paraphrasing as Machine Translation (自然言語処理特集号「言い換え」) , 2004 .

[12]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[13]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[14]  Kazuhide Yamamoto,et al.  Paraphrasing of Chinese Utterances , 2002, COLING.

[15]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[16]  Chris Quirk,et al.  Monolingual Machine Translation for Paraphrase Generation , 2004, EMNLP.

[17]  Regina Barzilay,et al.  Extracting Paraphrases from a Parallel Corpus , 2001, ACL.

[18]  Timothy Chklovski,et al.  1001 Paraphrases: Incenting Responsible Contributions in Collecting Paraphrases from Volunteers , 2005, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[19]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[20]  Pascale Fung,et al.  Multi-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus , 2004, COLING.

[21]  Daniel Marcu,et al.  Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences , 2003, NAACL.

[22]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[24]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[25]  Chris Brockett,et al.  Support Vector Machines for Paraphrase Identification and Corpus Construction , 2005, IJCNLP.

[26]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[27]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[28]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.