Patterns of text reuse in a scientific corpus

Significance In the modern electronic format it is both easier to reuse text and easier to detect reused text. This is the first comprehensive study of patterns of text reuse within the full texts of an important large scientific corpus, covering a 20-y timeframe. It provides an important baseline for what is regarded as standard practice within the affected research communities, a standard somewhat more lenient than that currently applied to journalists, popular authors, and public figures. We consider the incidence of text “reuse” by researchers via a systematic pairwise comparison of the text content of all articles deposited to arXiv.org from 1991 to 2012. We measure the global frequencies of three classes of text reuse and measure how chronic text reuse is distributed among authors in the dataset. We infer a baseline for accepted practice, perhaps surprisingly permissive compared with other societal contexts, and a clearly delineated set of aberrant authors. We find a negative correlation between the amount of reused text in an article and its influence, as measured by subsequent citations. Finally, we consider the distribution of countries of origin of articles containing large amounts of reused text.

[1]  A. Pennycook Borrowing Others' Words: Text, Ownership, Memory, and Plagiarism , 1996 .

[2]  G. Winer Recognizing and Avoiding Plagiarism , 2013 .

[3]  P. Ginsparg ArXiv at 20 , 2011, Nature.

[4]  J. Giles Preprint server seeks way to halt plagiarists , 2003, Nature.

[5]  I. Ojima,et al.  Notes on unfair papers by Mebarki et al. on ``quantum nonsymmetric gravity'' , 1999, hep-th/9912039.

[6]  Declan Butler,et al.  Journals step up plagiarism policing , 2010, Nature.

[7]  P. Resnick,et al.  Building Successful Online Communities: Evidence-Based Social Design , 2012 .

[8]  J. Bohannon Who's afraid of peer review? , 2013, Science.

[9]  David T. Bass,et al.  Editorial Policies and Practices , 2015 .

[10]  Plagiarism pinioned , 2010, Nature.

[11]  Toni Feder Experimenting with plagiarism detection on the arXiv , 2007 .

[12]  Karen L. Woolley,et al.  Publication misconduct and plagiarism retractions: a systematic, retrospective study , 2012, Current medical research and opinion.

[13]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[14]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[15]  Geoff Brumfiel,et al.  Turkish physicists face accusations of plagiarism , 2007, Nature.

[16]  Mounir Errami,et al.  A tale of two citations , 2008, Nature.

[17]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[18]  Damith C. Ranasinghe,et al.  An Evaluation Framework , 2008 .

[19]  Johannes Gehrke,et al.  Plagiarism Detection in arXiv , 2006, Sixth International Conference on Data Mining (ICDM'06).

[20]  M. Biagioli Recycling Texts or Stealing Time?: Plagiarism, Authorship, and Credit in Science , 2012, International Journal of Cultural Property.