A Deep Learning Approach to Persian Plagiarism Detection

Plagiarism detection is defined as automatic identification of reused text materials. General availability of the internet and easy access to textual information enhances the need for automated plagiarism detection. In this regard, different algorithms have been proposed to perform the task of plagiarism detection in text documents. Due to drawbacks and inefficiency of traditional methods and lack of proper algorithms for Persian plagiarism detection, in this paper, we propose a deep learning based method to detect plagiarism. In the proposed method, words are represented as multi-dimensional vectors, and simple aggregation methods are used to combine the word vectors for sentence representation. By comparing representations of source and suspicious sentences, pair sentences with the highest similarity are considered as the candidates for plagiarism. The decision on being plagiarism is performed using a two level evaluation method. Our method has been used in PAN2016 Persian plagiarism detection contest and results in %90.6 plagdet, %85.8 recall, and % 95.9 precision on the provided data sets.

[1]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[2]  Boris Katz,et al.  Lexical Chains and Sliding Locality Windows in Content-based Text Similarity Detection , 2005, IJCNLP.

[3]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[4]  A. Lathrop,et al.  Student Cheating and Plagiarism in the Internet Era: A Wake-Up Call , 2000 .

[5]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Alberto Barrón-Cedeño,et al.  Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection , 2013, CL.

[8]  Mohamed Elhadi,et al.  Use of text syntactical structures in detection of document duplicates , 2008, 2008 Third International Conference on Digital Information Management.

[9]  Man Yan Miranda Chong,et al.  A study on plagiarism detection and plagiarism direction identification using natural language processing techniques , 2013 .

[10]  Farzin Yaghmaee,et al.  Automatic external Persian plagiarism detection using vector space model , 2014, 2014 4th International Conference on Computer and Knowledge Engineering (ICCKE).

[11]  Paolo Rosso,et al.  Algorithms and Corpora for Persian Plagiarism Detection: Overview of PAN at FIRE 2016 , 2016, FIRE.

[12]  Demetrios G. Glinos A Hybrid Architecture for Plagiarism Detection , 2014, CLEF.

[13]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[14]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[15]  Mohamed Elhadi,et al.  Duplicate Detection in Documents and WebPages Using Improved Longest Common Subsequence and Documents Syntactical Structures , 2009, 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology.

[16]  Grigori Sidorov,et al.  Dynamically Adjustable Approach through Obfuscation Type Recognition , 2015, CLEF.

[17]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[18]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[19]  Paul Clough,et al.  Old and new challenges in automatic plagiarism detection , 2003 .

[20]  Benno Stein,et al.  Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling , 2014, CLEF.

[21]  Simon Suchomel,et al.  Three Way Search Engine Queries with Multi-feature Document Comparison for Plagiarism Detection , 2012, CLEF.

[22]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[24]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[25]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[26]  Hao-Ren Ke,et al.  Plagiarism Detection using ROUGE and WordNet , 2010, ArXiv.

[27]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[28]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[29]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[30]  Maryam Mahmoodi,et al.  Design a Persian Automated Plagiarism Detector (AMZPPD) , 2014, ArXiv.

[31]  Hermann A. Maurer,et al.  Plagiarism - A Problem And How To Fight It , 2007 .

[32]  Alexander Gelbukh,et al.  Comparing Similarity Measures for Original WSD Lesk Algorithm , 2009 .

[33]  Paolo Rosso,et al.  PAN 2015 Shared Task on Plagiarism Detection: Evaluation of Corpora for Text Alignment: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[34]  Massimo Moneglia,et al.  Plagiarism Detection through Multilevel Text Comparison , 2006, 2006 Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution (AXMEDIS'06).

[35]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[36]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[37]  Benno Stein,et al.  Ousting ivory tower research: towards a web framework for providing experiments as a service , 2012, SIGIR '12.

[38]  Mirella Lapata,et al.  Composition in Distributional Models of Semantics , 2010, Cogn. Sci..