Citation‐based plagiarism detection: Practicability on a large‐scale scientific corpus

The automated detection of plagiarism is an information retrieval task of increasing importance as the volume of readily accessible information on the web expands. A major shortcoming of current automated plagiarism detection approaches is their dependence on high character‐based similarity. As a result, heavily disguised plagiarism forms, such as paraphrases, translated plagiarism, or structural and idea plagiarism, remain undetected. A recently proposed language‐independent approach to plagiarism detection, Citation‐based Plagiarism Detection (CbPD), allows the detection of semantic similarity even in the absence of text overlap by analyzing the citation placement in a document's full text to determine similarity. This article evaluates the performance of CbPD in detecting plagiarism with various degrees of disguise in a collection of 185,000 biomedical articles. We benchmark CbPD against two character‐based detection approaches using a ground truth approximated in a user study. Our evaluation shows that the citation‐based approach achieves superior ranking performance for heavily disguised plagiarism forms. Additionally, we demonstrate CbPD to be computationally more efficient than character‐based approaches. Finally, upon combining the citation‐based with the traditional character‐based document similarity visualization methods in a hybrid detection prototype, we observe a reduction in the required user effort for document verification.

[1]  Bela Gipp Citation-based Plagiarism Detection , 2014, Springer Fachmedien Wiesbaden.

[2]  Andreas Nürnberger,et al.  Demonstration of citation pattern analysis for plagiarism detection , 2013, SIGIR.

[3]  Yalin Chen,et al.  RETRACTED: Simple mental arithmetic is not so simple: An ERP study of the split and odd–even effects in mental arithmetic , 2012, Neuroscience Letters.

[4]  Norman Meuschke,et al.  Citation pattern matching algorithms for citation-based plagiarism detection: greedy citation tiling, citation chunking and longest common citation sequence , 2011, DocEng '11.

[5]  Jöran Beel,et al.  Comparative evaluation of text- and citation-based plagiarism detection approaches using guttenplag , 2011, JCDL '11.

[6]  Benno Stein,et al.  Intrinsic plagiarism analysis , 2011, Lang. Resour. Evaluation.

[7]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[8]  Tara C. Long,et al.  Systematic Characterizations of Text Similarity in Full Text Biomedical Publications , 2010, PloS one.

[9]  Benno Stein,et al.  Corpus and Evaluation Measures for Automatic Plagiarism Detection , 2010, LREC.

[10]  Tuomo Kakkonen,et al.  Hermetic and Web Plagiarism Detection Systems for Student Essays—An Evaluation of the State-of-the-Art , 2010 .

[11]  Byung-Ryul Ahn,et al.  Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[12]  Ellen M. Voorhees,et al.  Bias and the limits of pooling for large collections , 2007, Information Retrieval.

[13]  E. Garfield Citation indexes for science; a new dimension in documentation through association of ideas. , 2006, Science.

[14]  Emi Fujioka,et al.  Identifying Information Provenance in Support of Intelligence Analysis, Sharing, and Protection , 2006, ISI.

[15]  E. Garfield Citation Indexes for Science: A New Dimension in Documentation through Association of Ideas , 1955 .

[16]  J. Grman,et al.  Improved Implementation for Finding Text Similarities in Large Sets of Data - Notebook for PAN at CLEF 2011. , 2011 .

[17]  Nijsje Dorman Citations. , 2011, BioTechniques.

[18]  Debora Weber-Wulff,et al.  Test cases for plagiarism detection software , 2010 .

[19]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[20]  Cristian Grozea,et al.  ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗ , 2009 .

[21]  Karl-Theodor Frhr. zu Guttenberg,et al.  Verfassung und Verfassungsvertrag : konstitutionelle Entwicklungsstufen in den USA und der EU , 2009 .

[22]  Peter C. R. Lane,et al.  Comparing Different Text Similarity Methods , 2007 .

[23]  Benno Stein,et al.  Plagiarism Detection Without Reference Collections , 2006, GfKl.