New Functions for Unsupervised Asymmetrical Paraphrase Detection

Monolingual text-to-text generation is an emerging research area in Natural Language Processing. One reason for the interest in such generation systems is the possibility to automatically learn text-to-text generation strategies from aligned monolingual corpora. In this context, paraphrase detection can be seen as the task of aligning sentences that convey the same information but yet are written in different forms, thereby building a training set of rewriting examples. In this paper, we propose a new type of mathematical functions for unsupervised detection of paraphrases, and test it over a set of standard paraphrase corpora. The results are promising as they outperform stateof- the-art functions developed for similar tasks. We consider two types of paraphrases - symmetrical and asymmetrical entailed - and show that although our proposed functions were conceived and oriented toward the asymmetrical detection, they perform rather well for symmetrical sentence pairs identification.

[1]  Satoshi Sekine,et al.  Automatic paraphrase acquisition from news articles , 2002 .

[2]  Gregory Grefenstette Producing Intelligent Telegraphic Text Reduction to provide an Audio Scanning Service for the Blind , 1998 .

[3]  Eleazar Eskin,et al.  Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning , 1999, EMNLP.

[4]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[5]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[6]  Daniel Marcu,et al.  Summarization beyond sentence extraction: A probabilistic approach to sentence compression , 2002, Artif. Intell..

[7]  Akira Shimazu,et al.  Example-based sentence reduction using the hidden markov model , 2004, TALIP.

[8]  Elijah Polak,et al.  Computational methods in optimization , 1971 .

[9]  Raman Chandrasekar,et al.  Automatic induction of rules for text simplification , 1997, Knowl. Based Syst..

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Emiel Krahmer,et al.  Explorations in Sentence Fusion , 2005, ENLG.

[12]  Kathleen McKeown,et al.  Cut and Paste Based Text Summarization , 2000, ANLP.

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[15]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.

[16]  Siobhan Devlin,et al.  Simplifying Text for Language-Impaired Readers , 1999, EACL.

[17]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[18]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.