Dissimilarity Kernels for Paraphrase Identification

We present in this paper a novel solution to the problem of paraphrase identification based on lexical dissimilarity kernels. Lexical kernels in conjunction with Support Vector Machines are preferred over other learning methods, e.g. decision trees, due to their ability to handle a high number of features. Dissimilarity-based kernels emphasize dissimilarities among text fragments and therefore are appropriate for text similarity tasks characterized by high lexical overlap. We conducted experiments with our kernels on the Microsoft Research (MSR) Paraphrase Corpus, a standardized data set used for assessing approaches to paraphrase identification. Our reported accuracy results are competitive and robust when compared to state-of-the-art single-model approaches. The results were obtained using 10-fold cross-validation over the entire corpus. We also report competitive results on the test portion of the MSR Paraphrase Corpus, which is the standard way to report results on this corpus.

[1]  David J. Weir,et al.  The Distributional Similarity of Sub-Parses , 2005, EMSEE@ACL.

[2]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[3]  Alessandro Moschitti,et al.  A machine learning approach to textual entailment recognition , 2009, Natural Language Engineering.

[4]  Arthur C. Graesser,et al.  AutoTutor: A Cognitive System That Simulates a Tutor Through Mixed-Initiative Dialogue , 2006 .

[5]  Stephen Wan,et al.  Using Dependency-Based Features to Take the ’Para-farce’ out of Paraphrase , 2006, ALTA.

[6]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[7]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[8]  Michael Collins,et al.  A New Statistical Parser Based on Bigram Lexical Dependencies , 1996, ACL.

[9]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[10]  Tat-Seng Chua,et al.  Paraphrase Recognition via Dissimilarity Significance Classification , 2006, EMNLP.

[11]  Arthur C. Graesser,et al.  Assessing Student Paraphrases Using Lexical Semantics and Word Weighting , 2009, AIED.

[12]  Samuel Fernando,et al.  A Semantic Similarity Approach to Paraphrase Detection , 2008 .

[13]  Rohit J. Kate A Dependency-based Word Subsequence Kernel , 2008, EMNLP.

[14]  Jimmy J. Lin,et al.  Extracting Structural Paraphrases from Aligned Monolingual Corpora , 2003, IWP@ACL.

[15]  Vasile Rus,et al.  Paraphrase Identification Using Weighted Dependencies and Word Semantics , 2010, Informatica.

[16]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[17]  Dekai Wu,et al.  Recognizing Paraphrases and Textual Entailment Using Inversion Transduction Grammars , 2005, EMSEE@ACL.

[18]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[19]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[20]  Jon Patrick,et al.  Paraphrase Identification by Text Canonicalization , 2005, ALTA.

[21]  Yixin Chen,et al.  Clustering of Defect Reports Using Graph Partitioning Algorithms , 2009, SEKE.

[22]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[23]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[24]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .