A Fast Multi-level Plagiarism Detection Method Based on Document Embedding Representation

Nowadays, global networks facilitate access to vast amount of textual information and enhance the feasibility of plagiarism as a consequence. Given the amount of text material produced everyday, the need for an automated fast plagiarism detection system is more crucial than ever. Plagiarism detection is defined as identification of reused text materials. In this regard, different algorithms have been proposed to perform the task of plagiarism detection in text documents. Due to limitation in semantic representation and computational inefficiency of traditional algorithms for plagiarism detection, in this paper, we proposed an embedding based document representation to detect plagiarism in documents using a two-level decision making approach. The method is language-independent and works properly on various languages as well. In the proposed method, words are represented as multi-dimensional vectors, and simple aggregation methods are used to combine the word vectors in order to represent sentences. By comparing representations of source and suspicious sentences, sentence pairs with the highest similarity score are considered as the candidates of the plagiarism cases. The final decision whether or not the pairs are plagiarized is taken using another level of similarity calculation using Jaccard metric by comparing the word sets of two sentences. Our method has been used in PAN2016 Persian plagiarism detection contest and results in 85.8% recall, 95.9% precision and 90.6% plagdet which is a combination of the these two measures with the measure of how concretely we retrieve plagiarism cases, on the provided data sets in a short amount of time. This method achieved the second place regarding plagdet and the first rank based on runtime.

[1]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[2]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[3]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[4]  Mohamed Elhadi,et al.  Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures , 2009, 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology.

[5]  Maryam Mahmoodi,et al.  Design a Persian Automated Plagiarism Detector (AMZPPD) , 2014, ArXiv.

[6]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[7]  Mohamed Elhadi,et al.  Use of text syntactical structures in detection of document duplicates , 2008, 2008 Third International Conference on Digital Information Management.

[8]  Man Yan Miranda Chong,et al.  A study on plagiarism detection and plagiarism direction identification using natural language processing techniques , 2013 .

[9]  Demetrios G. Glinos A Hybrid Architecture for Plagiarism Detection , 2014, CLEF.

[10]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[11]  Paul Clough,et al.  Old and new challenges in automatic plagiarism detection , 2003 .

[12]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[13]  Jiafeng Guo,et al.  Analysis of the Paragraph Vector Model for Information Retrieval , 2016, ICTIR.

[14]  Farzin Yaghmaee,et al.  Automatic external Persian plagiarism detection using vector space model , 2014, 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE).

[15]  Benno Stein,et al.  Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling , 2014, CLEF.

[16]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[17]  Simon Suchomel,et al.  Three Way Search Engine Queries with Multi-feature Document Comparison for Plagiarism Detection , 2012, CLEF.

[18]  Benno Stein,et al.  Ousting ivory tower research: towards a web framework for providing experiments as a service , 2012, SIGIR '12.

[19]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..

[20]  Gareth J. F. Jones,et al.  Word Vector Compositionality based Relevance Feedback using Kernel Density Estimation , 2016, CIKM.

[21]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[22]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[23]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[24]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[27]  Boris Katz,et al.  Lexical Chains and Sliding Locality Windows in Content-based Text Similarity Detection , 2005, IJCNLP.

[28]  Hao-Ren Ke,et al.  Plagiarism Detection using ROUGE and WordNet , 2010, ArXiv.

[29]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[30]  Hermann A. Maurer,et al.  Plagiarism - A Problem And How To Fight It , 2007 .

[31]  Alexander Gelbukh,et al.  Comparing Similarity Measures for Original WSD Lesk Algorithm , 2009 .

[32]  Paolo Rosso,et al.  PAN 2015 Shared Task on Plagiarism Detection: Evaluation of Corpora for Text Alignment: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[33]  Massimo Moneglia,et al.  Plagiarism Detection through Multilevel Text Comparison , 2006, 2006 Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution (AXMEDIS'06).

[34]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[35]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[36]  Mandar Mitra,et al.  Word Embedding based Generalized Language Model for Information Retrieval , 2015, SIGIR.