Reducing computational effort for plagiarism detection by using citation characteristics to limit retrieval space

This paper proposes a hybrid approach to plagiarism detection in academic documents that integrates detection methods using citations, semantic argument structure, and semantic word similarity with character-based methods to achieve a higher detection performance for disguised plagiarism forms. Currently available software for plagiarism detection exclusively performs text string comparisons. These systems find copies, but fail to identify disguised plagiarism, such as paraphrases, translations, or idea plagiarism. Detection approaches that consider semantic similarity on word and sentence level exist and have consistently achieved higher detection accuracy for disguised plagiarism forms compared to character-based approaches. However, the high computational effort of these semantic approaches makes them infeasible for use in real-world plagiarism detection scenarios. The proposed hybrid approach uses citation-based methods as a preliminary heuristic to reduce the retrieval space with a relatively low loss in detection accuracy. This preliminary step can then be followed by a computationally more expensive semantic and character-based analysis. We show that such a hybrid approach allows semantic plagiarism detection to become feasible even on large collections for the first time.

[1]  Benno Stein Principles of hash-based text retrieval , 2007, SIGIR.

[2]  Alexander F. Gelbukh,et al.  PPChecker: Plagiarism Pattern Checker in Document Copy Detection , 2006, TSD.

[3]  Maria Soledad Pera,et al.  SimPaD: A word-similarity sentence-based plagiarism detection tool on Web documents , 2011, Web Intell. Agent Syst..

[4]  Norman Meuschke,et al.  Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence , 2011, DocEng '11.

[5]  Norman Meuschke,et al.  Citation‐based plagiarism detection: Practicability on a large‐scale scientific corpus , 2014, J. Assoc. Inf. Sci. Technol..

[6]  Jöran Beel,et al.  Citation Proximity Analysis (CPA) : A New Approach for Identifying Related Work Based on Co-Citation Analysis , 2009 .

[7]  Naomie Salim,et al.  Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[8]  Masaki Eto,et al.  Evaluations of context-based co-citation searching , 2012, Scientometrics.

[9]  Naomie Salim,et al.  An improved plagiarism detection scheme based on semantic role labeling , 2012, Appl. Soft Comput..

[10]  Jöran Beel,et al.  Comparative evaluation of text- and citation-based plagiarism detection approaches using guttenplag , 2011, JCDL '11.

[11]  Naomie Salim,et al.  Plagiarism detection scheme based on Semantic Role Labeling , 2012, 2012 International Conference on Information Retrieval & Knowledge Management.

[12]  Cristian Grozea,et al.  ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗ , 2009 .

[13]  George Tsatsaronis Identifying free text plagiarism based on semantic similarity , 2010 .

[14]  M. M. Kessler Bibliographic coupling between scientific papers , 1963 .

[15]  Peter C. R. Lane,et al.  Comparing Different Text Similarity Methods , 2007 .

[16]  Bela Gipp Citation-based Plagiarism Detection - Detecting Disguised and Cross-language Plagiarism using Citation Pattern Analysis , 2014 .

[17]  Henry G. Small,et al.  Co-citation in the scientific literature: A new measure of the relationship between two documents , 1973, J. Am. Soc. Inf. Sci..

[18]  Naomie Salim,et al.  Web Based Cross Language Plagiarism Detection , 2010, 2010 Second International Conference on Computational Intelligence, Modelling and Simulation.

[19]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[20]  Debora Weber-Wulff,et al.  Test cases for plagiarism detection software , 2010 .