Copy detection for digital documents

Plagiarism is becoming a commonly concerned problem in recent years. With the advance of Internet, it is easier and easier to access the writings of other people. When someone uses the content of a writing inappropriately, it may cause the problem of plagiarism. Plagiarism may infringe the intellectual property rights, so it is a serious problem nowadays. In this paper, we propose a new method for detecting plagiarism. We use Word2vec to obtain the vector representations of the sentences involved in the documents. We also apply principal component analysis (PCA) to reduce the size of dimensionality. By using sentence-level representations, plagiarisms can be effectively detected. We will present experimental results and compare with other methods to show the plagiarism detection capability of our proposed method.

[1]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[2]  Tommy W. S. Chow,et al.  A coarse-to-fine framework to efficiently thwart plagiarism , 2011, Pattern Recognit..

[3]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[6]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[7]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[8]  Andrew Y. Ng,et al.  Improving Word Representations via Global Context and Multiple Word Prototypes , 2012, ACL.

[9]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[12]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[13]  Lukás Burget,et al.  Neural network based language models for highly inflective languages , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[16]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..