ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗

In this paper we describe a new general plagiarism detection method, that we used in our winning entry to the 1 st International Competition on Plagia- rism Detection, the external plagiarism detection task, which assumes the source documents are available. In the first phase of our method, a matrix of kernel values is computed, which gives a similarity value based on n-grams between each source and each suspicious document. In the second phase, each promising pair is further investigated, in order to extract the precise positions and lengths of the subtexts that have been copied and maybe obfuscated - using encoplot, a novel linear time pairwise sequence matching technique. We solved the significant computational chal- lenges arising from having to compare millions of document pairs by using a library developed by our group mainly for use in network security tools. The performance achieved is comparing more than 49 million pairs of documents in 12 hours on a single computer. The results in the challenge were very good, we outperformed all other methods.

[1]  Benno Stein,et al.  PAN Plagiarism Corpus PAN-PC-09 , 2009 .

[2]  Christopher Krügel,et al.  Service specific anomaly detection for network intrusion detection , 2002, SAC '02.

[3]  Salvatore J. Stolfo,et al.  Anomalous Payload-Based Network Intrusion Detection , 2004, RAID.

[4]  Xiao-Dong Liu,et al.  Finding Plagiarism Based on Common Semantic Sequence Model , 2004, WAIM.

[5]  C Grozea Plagiarism Detection with State of the Art Compression Programs , 2004 .

[6]  James A. Malcolm,et al.  A theoretical basis to the automated detection of copying between texts, and its practical implementation in the Ferret plagiarism and collusion detector , 2004 .

[7]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[8]  Konrad Rieck,et al.  Language models for detection of unknown attacks in network traffic , 2006, Journal in Computer Virology.

[9]  Alberto Barrón-Cedeño,et al.  Reducing the Plagiarism Detection Search Space on the Basis of the Kullback-Leibler Distance , 2009, CICLing.

[10]  Konrad Rieck,et al.  Detecting Unknown Network Attacks Using Language Models , 2006, DIMVA.

[11]  Kenneth Ward Church,et al.  Dotplot : a program for exploring self-similarity in millions of lines of text and code , 1993 .

[12]  Shen Jun-Yi,et al.  Document copy detection based on kernel method , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[13]  Alexander F. Gelbukh,et al.  PPChecker: Plagiarism Pattern Checker in Document Copy Detection , 2006, TSD.

[14]  Peter C. R. Lane,et al.  Copy detection in Chinese documents using Ferret , 2007, Lang. Resour. Evaluation.

[15]  J. Maizel,et al.  Enhanced graphic matrix analysis of nucleic acid and protein sequences. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[17]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[18]  Konrad Rieck,et al.  Linear-Time Computation of Similarity Measures for Sequential Data , 2008, J. Mach. Learn. Res..

[19]  Xiao-Dong Liu,et al.  Semantic Sequence Kin: A Method of Document Copy Detection , 2004, PAKDD.

[20]  Salvatore J. Stolfo,et al.  Anagram: A Content Anomaly Detector Resistant to Mimicry Attack , 2006, RAID.

[21]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[22]  Alberto Barrón-Cedeño,et al.  On Automatic Plagiarism Detection Based on n-Grams Comparison , 2009, ECIR.