Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence

Plagiarism Detection Systems have been developed to locate instances of plagiarism e.g. within scientific papers. Studies have shown that the existing approaches deliver reasonable results in identifying copy&paste plagiarism, but fail to detect more sophisticated forms such as paraphrased plagiarism, translation plagiarism or idea plagiarism. The authors of this paper demonstrated in recent studies that the detection rate can be significantly improved by not only relying on text analysis, but by additionally analyzing the citations of a document. Citations are valuable language independent markers that are similar to a fingerprint. In fact, our examinations of real world cases have shown that the order of citations in a document often remains similar even if the text has been strongly paraphrased or translated in order to disguise plagiarism. This paper introduces three algorithms and discusses their suitability for the purpose of citation-based plagiarism detection. Due to the numerous ways in which plagiarism can occur, these algorithms need to be versatile. They must be capable of detecting transpositions, scaling and combinations in a local and global form. The algorithms are coined Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence. The evaluation showed that if these algorithms are combined, common forms of plagiarism can be detected reliably.

[1]  Arkady B. Zaslavsky,et al.  Document overlap detection system for distributed digital libraries , 2000, DL '00.

[2]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[3]  Donald L. Mccabe Cheating among college and university students: A North American perspective , 2005 .

[4]  Arkady B. Zaslavsky,et al.  Signature Extraction for Overlap Detection in Documents , 2002, ACSC.

[5]  Rynson W. H. Lau,et al.  CHECK: a document plagiarism detection system , 1997, SAC '97.

[6]  Mounir Errami,et al.  Déjà vu: a database of highly similar citations in the scientific literature , 2008, Nucleic Acids Res..

[7]  Benno Stein,et al.  Near Similarity Search and Plagiarism Analysis , 2005, GfKl.

[8]  Benno Stein,et al.  Intrinsic Plagiarism Detection , 2006, ECIR.

[9]  Cristian S. Calude,et al.  Journal of Universal Computer Science , 1994, J. Univers. Comput. Sci..

[10]  Lokman I. Meho,et al.  Impact of data sources on citation counts and rankings of LIS faculty: Web of science versus scopus and google scholar , 2007 .

[11]  T. J. Phelan,et al.  A compendium of issues for citation analysis , 1999, Scientometrics.

[12]  Heinz Dreher,et al.  Issues in Informing Science and Information Technology Automatic Conceptual Analysis for Plagiarism Detection , 2022 .

[13]  Arkady B. Zaslavsky,et al.  Suffix Vector: Space- and Time-Efficient Alternative to Suffix Trees , 2002, ACSC.

[14]  N. Mohaghegh,et al.  WHY THE IMPACT FACTOR OF JOURNALS SHOULD NOT BE USED FOR EVALUATING RESEARCH , 2005 .

[15]  Benno Stein,et al.  Fuzzy-Fingerprints for Text-Based Information Retrieval , 2005 .

[16]  Bella Hass Weinberg,et al.  Bibliographic coupling: A review , 1974, Inf. Storage Retr..

[17]  Debora Weber-Wulff,et al.  Test cases for plagiarism detection software , 2010 .

[18]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[19]  Karl-Theodor Frhr. zu Guttenberg,et al.  Verfassung und Verfassungsvertrag : konstitutionelle Entwicklungsstufen in den USA und der EU , 2009 .

[20]  Lokman I. Meho,et al.  Impact of data sources on citation counts and rankings of LIS faculty: Web of science versus scopus and google scholar , 2007, J. Assoc. Inf. Sci. Technol..

[21]  George Tsatsaronis Identifying free text plagiarism based on semantic similarity , 2010 .

[22]  Peter C. R. Lane,et al.  Comparing Different Text Similarity Methods , 2007 .

[23]  Sami Surakka,et al.  Plaggie: GNU-licensed source code plagiarism detection engine for Java exercises , 2006, Baltic Sea '06.

[24]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .

[25]  Lutz Prechelt,et al.  JPlag: Finding plagiarisms among a set of programs , 2000 .

[26]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[27]  Jöran Beel,et al.  Citation based plagiarism detection: a new approach to identify plagiarized work language independently , 2010, HT '10.

[28]  E GARFIELD,et al.  Citation indexes for science; a new dimension in documentation through association of ideas. , 2006, Science.

[29]  T. Bretag,et al.  Self-Plagiarism or Appropriate Textual Re-use? , 2009 .

[30]  B. C. Griffith,et al.  The Structure of Scientific Literatures II: Toward a Macro- and Microstructure for Science , 1974 .

[31]  Joseph Rudman,et al.  The State of Authorship Attribution Studies: Some Problems and Solutions , 1997, Comput. Humanit..

[32]  William F. Smyth,et al.  Computing Patterns in Strings , 2003 .

[33]  B. C. Griffith,et al.  The Structure of Scientific Literatures I: Identifying and Graphing Specialties , 1974 .

[34]  Byung-Ryul Ahn,et al.  Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[35]  Johannes Gehrke,et al.  Plagiarism Detection in arXiv , 2006, Sixth International Conference on Data Mining (ICDM'06).

[36]  G. Fröhlich Plagiate und unethische Autorenschaften , 2006 .

[37]  Martin Andreas Gutbrod Nachhaltiges E-Learning durch sekundäre Dienste , 2007 .

[38]  Tuomo Kakkonen,et al.  Hermetic and Web Plagiarism Detection Systems for Student Essays—An Evaluation of the State-of-the-Art , 2010 .

[39]  Benno Stein,et al.  Intrinsic plagiarism analysis , 2011, Lang. Resour. Evaluation.

[40]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[41]  Jan Kasprzak,et al.  Improving the Reliability of the Plagiarism Detection System - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[42]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[43]  Alexander F. Gelbukh,et al.  PPChecker: Plagiarism Pattern Checker in Document Copy Detection , 2006, TSD.

[44]  Brenda S. Baker,et al.  A Program for Identifying Duplicated Code , 1992 .

[45]  Mounir Errami,et al.  Déjà vu - A study of duplicate citations in Medline , 2008, Bioinform..

[46]  Vivien K. G. Lim,et al.  Attitudes Toward, and Intentions to Report, Academic Cheating Among Students in Singapore , 2001 .

[47]  T. Mcarthur Concise Oxford Companion to the English Language , 1992 .

[48]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[49]  Jöran Beel,et al.  Comparative evaluation of text- and citation-based plagiarism detection approaches using guttenplag , 2011, JCDL '11.