A contrastive review of paraphrase acquisition techniques

This paper addresses the issue of what approach should be used for building a corpus of sententential paraphrases depending on one's requirements. Six strategies are studied: (1) multiple translations into a single language from another language; (2) multiple translations into a single language from different other languages; (3) multiple descriptions of short videos; (4) multiple subtitles for the same language; (5) headlines for similar news articles; and (6) sub-sentential paraphrasing in the context of a Web-based game. We report results on French for 50 paraphrase pairs collected for all these strategies, where corpora were manually aligned at the finest possible level to define oracle performance in terms of accessible sub-sentential paraphrases. The differences observed will be used as criteria for motivating the choice of a given approach before attempting to build a new paraphrase corpus.

[1]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[2]  Nitin Madnani,et al.  Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods , 2010, CL.

[3]  Chris Quirk,et al.  Monolingual Machine Translation for Paraphrase Generation , 2004, EMNLP.

[4]  Ulrich Germann,et al.  Yawat: Yet Another Word Alignment Tool , 2008, ACL.

[5]  Jörg Tiedemann Building a Multilingual Parallel Subtitle Corpus , 2007 .

[6]  Regina Barzilay,et al.  Extracting Paraphrases from a Parallel Corpus , 2001, ACL.

[7]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[10]  Philipp Koehn,et al.  Word Lattices for Multi-Source Translation , 2009, EACL.

[11]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[12]  Yuji Matsumoto,et al.  Building a Paraphrase Corpus for Speech Translation , 2004, LREC.

[13]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[14]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.

[15]  Marius Pasca,et al.  Mining Paraphrases from Self-anchored Web Sentence Fragments , 2005, PKDD.

[16]  Preslav Nakov,et al.  Improved Statistical Machine Translation Using Monolingual Paraphrases , 2008, ECAI.

[17]  Nitin Madnani,et al.  Are Multiple Reference Translations Necessary? Investigating the Value of Paraphrased Reference Translations in Parameter Optimization , 2008, AMTA.

[18]  Mirella Lapata,et al.  Constructing Corpora for the Development and Evaluation of Paraphrase Systems , 2008, CL.

[19]  Haifeng Wang,et al.  Leveraging Multiple MT Engines for Paraphrase Generation , 2010, COLING.

[20]  Timothy Chklovski,et al.  Collecting paraphrase corpora from volunteer contributors , 2005, K-CAP '05.

[21]  Manabu Okumura,et al.  Corpus and Evaluation Measures for Multiple Document Summarization with Multiple Sources , 2004, COLING.

[22]  Emiel Krahmer,et al.  Clustering and Matching Headlines for Automatic Paraphrase Acquisition , 2009, ENLG.

[23]  Guillaume Wisniewski,et al.  Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History , 2022, LREC.