An Innovative Similarity Measure for Sentence Plagiarism Detection

We propose and experimentally assess Semantic Word Error Rate (SWER), an innovative similarity measure for sentence plagiarism detection. SWER introduces a complex approach based on latent semantic analysis, which is capable of outperforming the accuracy of competitor methods in plagiarism detection. We provide principles and functionalities of SWER, and we complement our analytical contribution by means of a significant preliminary experimental analysis. Derived results are promising, and confirm to use the goodness of our proposal.

[1]  Rada Mihalcea,et al.  Measuring semantic relatedness using salient encyclopedic concepts , 2011 .

[2]  Dana Shapira,et al.  Edit distance with move operations , 2002, J. Discrete Algorithms.

[3]  Phil D. Green,et al.  From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition , 2004, INTERSPEECH.

[4]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[5]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[6]  Simone Paolo Ponzetto,et al.  Knowledge Derived From Wikipedia For Computing Semantic Relatedness , 2007, J. Artif. Intell. Res..

[7]  Michael D. Lee,et al.  An Empirical Evaluation of Models of Text Document Similarity , 2005 .

[8]  Kristian J. Hammond,et al.  Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System , 1997, AI Mag..

[9]  Sunju Park,et al.  Credible, resilient, and scalable detection of software plagiarism using authority histograms , 2016, Knowl. Based Syst..

[10]  Alfredo Cuzzocrea,et al.  Data warehousing and OLAP over big data: current challenges and future research directions , 2013, DOLAP '13.

[11]  Jacob Eisenstein,et al.  Discriminative Improvements to Distributional Sentence Similarity , 2013, EMNLP.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14]  Zuhair Bandar,et al.  A new benchmark dataset with production methodology for short text semantic similarity algorithms , 2013, TSLP.

[15]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[16]  Jeffrey D. Ullman,et al.  Big data: a research agenda , 2013, IDEAS '13.

[17]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[18]  Felipe Bravo-Marquez,et al.  DOCODE 3.0 (DOcument COpy DEtector): A system for plagiarism detection by applying an information fusion process from multiple documental data sources , 2016, Inf. Fusion.

[19]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[20]  Alon Lavie,et al.  The Meteor metric for automatic evaluation of machine translation , 2009, Machine Translation.

[21]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[22]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[23]  Ramiz M. Aliguliyev,et al.  A new sentence similarity measure and sentence based extractive technique for automatic text summarization , 2009, Expert Syst. Appl..

[24]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[25]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[26]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[27]  Ivan Smirnov,et al.  Exactus Like: Plagiarism Detection in Scientific Texts , 2016, ECIR.

[28]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[29]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[30]  Alfredo Cuzzocrea Analytics over Big Data: Exploring the Convergence of DataWarehousing, OLAP and Data-Intensive Cloud Infrastructures , 2013, 2013 IEEE 37th Annual Computer Software and Applications Conference.

[31]  Ivan Jaric High time for a common plagiarism detection system , 2015, Scientometrics.

[32]  James Allan,et al.  Retrieval and novelty detection at the sentence level , 2003, SIGIR.

[33]  Zuhair Bandar,et al.  Sentence similarity based on semantic nets and corpus statistics , 2006, IEEE Transactions on Knowledge and Data Engineering.