Boosting paraphrase detection through textual similarity metrics with abductive networks

Graphical abstractDisplay Omitted HighlightsAnalyze a set of weak text reuse similarity metrics for paraphrase detection.Boost the performance of individual metrics using the abductive learning paradigm.Use decision-level fusion to build a committee of models of individual metrics.Use feature-level fusion to get a paraphrase detector using optimal set of metrics.Validate merits of the approach over individual metrics and other learning methods. A number of metrics have been proposed in the literature to measure text re-use between pairs of sentences or short passages. These individual metrics fail to reliably detect paraphrasing or semantic equivalence between sentences, due to the subjectivity and complexity of the task, even for human beings. This paper analyzes a set of five simple but weak lexical metrics for measuring textual similarity and presents a novel paraphrase detector with improved accuracy based on abductive machine learning. The objective here is 2-fold. First, the performance of each individual metric is boosted through the abductive learning paradigm. Second, we investigate the use of decision-level and feature-level information fusion via abductive networks to obtain a more reliable composite metric for additional performance enhancement. Several experiments were conducted using two benchmark corpora and the optimal abductive models were compared with other approaches. Results demonstrate that applying abductive learning has significantly improved the results of individual metrics and further improvement was achieved through fusion. Moreover, building simple models of polynomial functional elements that identify and integrate the smallest subset of relevant metrics yielded better results than those obtained from the support vector machine classifiers utilizing the same datasets and considered metrics. The results were also comparable to the best result reported in the literature even with larger number of more powerful features and/or using more computationally intensive techniques.

[1]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[2]  El-Sayed M. El-Alfy,et al.  Construction and analysis of educational tests using abductive machine learning , 2008, Comput. Educ..

[3]  D. Uribe Recognition of Paraphrasing Pairs , 2008, 2008 Electronics, Robotics and Automotive Mechanics Conference (CERMA '08).

[4]  Vasudeva Varma,et al.  Cross Lingual Text Reuse Detection Based on Keyphrase Extraction and Similarity Measures , 2011, FIRE.

[5]  Benno Stein,et al.  Paraphrase acquisition via crowdsourcing and machine learning , 2013, TIST.

[6]  P. Brazdil,et al.  A Metric for Paraphrase Detection , 2007, 2007 International Multi-Conference on Computing in the Global Information Technology (ICCGI'07).

[7]  Naomie Salim,et al.  An improved plagiarism detection scheme based on semantic role labeling , 2012, Appl. Soft Comput..

[8]  Jon Patrick,et al.  Paraphrase Identification by Text Canonicalization , 2005, ALTA.

[9]  Rada Mihalcea,et al.  Measuring the Semantic Similarity of Texts , 2005, EMSEE@ACL.

[10]  Arthur C. Graesser,et al.  Paraphrase Identification with Lexico-Syntactic Graph Subsumption , 2008, FLAIRS.

[11]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[12]  Nitin Madnani,et al.  Re-examining Machine Translation Metrics for Paraphrase Identification , 2012, NAACL.

[13]  Keith C. Drake,et al.  Abductive reasoning networks , 1991, Neurocomputing.

[14]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[15]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[16]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[17]  Johanna Geiß,et al.  Latent semantic sentence clustering for multi-document summarization , 2011 .

[18]  Anupriya Rajkumar,et al.  Paraphrase Recognition using Neural Network Classification , 2010 .

[19]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[20]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[21]  Sotiris B. Kotsiantis Use of machine learning techniques for educational proposes: a decision support system for forecasting students’ grades , 2011, Artificial Intelligence Review.

[22]  Paolo Rosso,et al.  Determining and characterizing the reused text for plagiarism detection , 2013, Expert Syst. Appl..

[23]  Chris Brockett,et al.  Support Vector Machines for Paraphrase Identification and Corpus Construction , 2005, IJCNLP.

[24]  Akira Shimazu,et al.  Exploiting discourse information to identify paraphrases , 2014, Expert Syst. Appl..

[25]  Tony V. Harrison,et al.  Modeling Unknown Relationships with Polynomial Networks , 2009 .

[26]  Stephen Wan,et al.  Using Dependency-Based Features to Take the ’Para-farce’ out of Paraphrase , 2006, ALTA.

[27]  Gerard J. Montgomery,et al.  Abductive networks applied to electronic combat , 1990, Defense, Security, and Sensing.

[28]  Anupriya Rajkumar,et al.  A Study on Paraphrase Recognition Using Radial Basis Function Neural Network , 2012 .

[29]  Kenneth Ward Church,et al.  Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus , 2001, Computational Linguistics.

[30]  Akira Shimazu,et al.  EDU-Based Similarity for Paraphrase Identification , 2013, NLDB.

[31]  Esra Eret,et al.  Plagiarism in higher education: A case study with prospective academicians , 2010 .

[32]  Cordeiro João,et al.  New Functions for Unsupervised Asymmetrical Paraphrase Detection , 2007 .

[33]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[34]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[35]  Anurag Agarwal Abductive Networks For Two-Group Classification: A Comparison With Neural Networks , 2011 .

[36]  Hwee Tou Ng,et al.  Better Evaluation Metrics Lead to Better Machine Translation , 2011, EMNLP.

[37]  Zornitsa Kozareva,et al.  Paraphrase Identification on the Basis of Supervised Machine Learning Techniques , 2006, FinTAL.

[38]  Eiichiro Sumita,et al.  Using Machine Translation Evaluation Techniques to Determine Sentence-level Semantic Equivalence , 2005, IJCNLP.

[39]  Ion Androutsopoulos,et al.  A Survey of Paraphrasing and Textual Entailment Methods , 2009, J. Artif. Intell. Res..

[40]  Alexander Gelbukh,et al.  BASELINES FOR NATURAL LANGUAGE PROCESSING TASKS BASED ON SOFT CARDINALITY SPECTRA , 2012 .

[41]  Liang Zhou,et al.  Re-evaluating Machine Translation Results with Paraphrase Support , 2006, EMNLP.

[42]  Naomie Salim,et al.  Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[43]  Iryna Gurevych,et al.  Text Reuse Detection using a Composition of Text Similarity Measures , 2012, COLING.

[44]  Lyle H. Ungar,et al.  Penn: Using Word Similarities to better Estimate Sentence Similarity , 2012, SemEval@NAACL-HLT.