Using Discourse Information for Paraphrase Extraction

Previous work on paraphrase extraction using parallel or comparable corpora has generally not considered the documents' discourse structure as a useful information source. We propose a novel method for collecting paraphrases relying on the sequential event order in the discourse, using multiple sequence alignment with a semantic similarity measure. We show that adding discourse information boosts the performance of sentence-level paraphrase acquisition, which consequently gives a tremendous advantage for extracting phrase-level paraphrase fragments from matched sentences. Our system beats an informed baseline by a margin of 50%.

[1]  Satoshi Sekine,et al.  Paraphrase Acquisition for Information Extraction , 2003, IWP@ACL.

[2]  Chris Callison-Burch,et al.  Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases , 2009, EMNLP.

[3]  Heeyoung Lee,et al.  Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task , 2011, CoNLL Shared Task.

[4]  Daniel Gillick,et al.  Sentence Boundary Detection and the Problem with the U.S. , 2009, NAACL.

[5]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[6]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[7]  Manfred Pinkal,et al.  Learning Script Knowledge with Web Experiments , 2010, ACL.

[8]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[9]  NeyHermann,et al.  A systematic comparison of various statistical alignment models , 2003 .

[10]  Satoshi Sekine,et al.  Automatic paraphrase acquisition from news articles , 2002 .

[11]  Haifeng Wang,et al.  Pivot Approach for Extracting Paraphrase Patterns from Bilingual Corpora , 2008, ACL.

[12]  Chris Callison-Burch,et al.  Syntactic Constraints on Paraphrases Extracted from Parallel Corpora , 2008, EMNLP.

[13]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[14]  Andrew Salway,et al.  Matching Verb Attributes for Cross-Document Event Co-reference , 2005 .

[15]  Chris Callison-Burch,et al.  Paraphrasing with Bilingual Parallel Corpora , 2005, ACL.

[16]  Noah A. Smith,et al.  Tree Edit Models for Recognizing Textual Entailments, Paraphrases, and Answers to Questions , 2010, NAACL.

[17]  William S-Y Wang Chinese-English Machine Translation System. , 1975 .

[18]  Patrick Pantel,et al.  DIRT @SBT@discovery of inference rules from text , 2001, KDD '01.

[19]  Pierre Zweigenbaum,et al.  ACL-IJCNLP 2009 BUCC 2009 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora , 2009 .

[20]  Samuel Fernando,et al.  A Semantic Similarity Approach to Paraphrase Detection , 2008 .

[21]  Georgiana Dinu,et al.  Inference Rules and their Application to Recognizing Textual Entailment , 2009, EACL.

[22]  Regina Barzilay,et al.  Information Fusion in the Context of Multi-Document Summarization , 1999, ACL.

[23]  Chris Quirk,et al.  Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction , 2007 .

[24]  Chris Callison-Burch,et al.  Paraphrase Fragment Extraction from Monolingual Comparable Corpora , 2011, BUCC@ACL.

[25]  Dekang Lin,et al.  DIRT – Discovery of Inference Rules from Text , 2001 .

[26]  Ido Dagan,et al.  Scaling Web-based Acquisition of Entailment Relations , 2004, EMNLP.

[27]  Eugene S. Edgington,et al.  Randomization Tests , 2011, International Encyclopedia of Statistical Science.

[28]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[29]  David Yarowsky,et al.  The Johns Hopkins University 2003 Chinese-English machine translation system , 2003, MTSUMMIT.

[30]  Andreas Noack,et al.  Energy Models for Graph Clustering , 2007, J. Graph Algorithms Appl..

[31]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[32]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[33]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[34]  Regina Barzilay,et al.  Extracting Paraphrases from a Parallel Corpus , 2001, ACL.

[35]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[36]  Chris Quirk,et al.  Monolingual Machine Translation for Paraphrase Generation , 2004, EMNLP.

[37]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[38]  Chris Callison-Burch,et al.  Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation , 2011, EMNLP.

[39]  Breck Baldwin,et al.  Cross-Document Event Coreference: Annotations, Experiments, and Observations , 1999, COREF@ACL.

[40]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[41]  Haifeng Wang,et al.  Leveraging Multiple MT Engines for Paraphrase Generation , 2010, COLING.

[42]  Stefan Thater,et al.  Word Meaning in Context: A Simple and Effective Vector Model , 2011, IJCNLP.