We present the Webis-STEREO-21 dataset, a massive collection of Scientific Text Reuse in Open-access publications. It contains more than 91 million cases of reused text passages found in 4.2 million unique open-access publications. Featuring a high coverage of scientific disciplines and varieties of reuse, as well as comprehensive metadata to contextualize each case, our dataset addresses the most salient shortcomings of previous ones on scientific writing. Webis-STEREO-21 allows for tackling a wide range of research questions from different scientific backgrounds, facilitating both qualitative and quantitative analysis of the phenomenon as well as a first-time grounding on the base rate of text reuse in scientific publications. Background & Summary The reuse of text has a longstanding history in science. In qualitative research, besides verbatim quotations, the techniques of paraphrasing, translation, and summarization are instrumental to both teaching and learning scientific writing as well as to gaining new scientific insights1. In quantitative research, the use of templates as an efficient way of reporting new results on otherwise standardized workflows is common2. As science often progresses incrementally, authors may also reuse their texts across (different types of) subsequent publications on the same subject (also called “text recycling”)2–4. Likewise, in interdisciplinary research, reuse across publications at venues of different disciplines has been observed to promote the dissemination of new insights5,6. Orthogonal to all of these manifestations of text reuse is the scientific context that stipulates its legitimacy: Plagiarism, the intentional reuse of text with lacking acknowledgment of its original source, violates codes of honor and academic integrity7. Text reuse has been quantitatively studied in many scientific disciplines1,2,8–10; yet few studies assess the phenomenon at scale, beyond what can be manually analyzed9,10. Large-scale studies require the use of automatic text reuse detection technology. This being both algorithmically challenging and computationally expensive, lacking expertise or budget may have prevented such studies. Employing proprietary analysis software or services instead, too, is subject to budgetary limitations, in addition to their lack of methodological transparency and reproducibility. Text reuse detection itself is still subject to ongoing research in natural language processing and information retrieval. Setting up a custom processing pipeline thus demands an evaluation against the state of the art. The challenges in constructing a competitive solution for this task arise from the aforementioned diversity of different forms of text reuse, the large solution space of detection approaches, and the need to apply heuristics that render a given solution sufficiently scalable. Preprocessing a collection of scientific publications, too, presents its own difficulties. This includes the noisy and error-prone conversion of a publication’s original PDF version to machine-readable text and the collection of reliable metadata about the publications. The available quantitative studies on scientific text reuse lack with respect to the presentation of preprocessing steps taken, the design choices of the solution to text reuse detection, and their justification in terms of rigorous evaluation. Altogether, comparable, reproducible, reliable, and accessible research on the phenomenon of scientific text reuse remains an open problem. To provide for a solid new foundation for the investigation of scientific text reuse within and across disciplines, we compile Webis-STEREO-21. To overcome the aforementioned issues, we stipulate three design principles for the creation of the dataset: (1) high coverage, both in terms of the number of included publications and the variety of scientific disciplines; (2) a scalable approach to reuse detection with a focus on high precision at a competitive recall, capturing a comprehensive set of reused passages as reliable resource for research on scientific text reuse; and (3) comprehensive metadata to contextualize each case, to address a wide range of potential hypotheses, and to provide the basis for semantic post-processing, for instance, to separate benign text reuse from plagiarism. 1 ar X iv :2 11 2. 11 80 0v 1 [ cs .D L ] 2 2 D ec 2 02 1 Webis-STEREO-21 results from applying scalable text reuse detection approaches to a large collection of scientific open-access publications, exhaustively comparing all documents to extract a comprehensive dataset of reused passages between them. It contains more than 91 million cases of reused passages among 4.2 million unique publications. The cases stem from 46 scientific fields of study, grouped into 14 scientific areas in all four major scientific disciplines, and spanning over 150 years of scientific publishing between 1860 and 2018. The data is openly accessible to be useful to a wide range of researchers with different scientific backgrounds, enabling both qualitative and quantitative analysis.
[1]
Jean-Gabriel Ganascia,et al.
Automatic detection of reuses and citations in literary texts
,
2014,
Lit. Linguistic Comput..
[2]
Yang Song,et al.
An Overview of Microsoft Academic Service (MAS) and Applications
,
2015,
WWW.
[3]
Fang-Ying Yang,et al.
Uncovering published authors' text-borrowing practices: Paraphrasing strategies, sources, and self-plagiarism
,
2015
.
[4]
Jie Tang,et al.
ArnetMiner: extraction and mining of academic social networks
,
2008,
KDD.
[5]
Cary Moskovitz.
Self-Plagiarism, Text Recycling, and Science Education
,
2016
.
[6]
S. Horbach,et al.
The extent and causes of academic text recycling or ‘self-plagiarism’
,
2017,
Research Policy.
[7]
Andrei Z. Broder,et al.
On the resemblance and containment of documents
,
1997,
Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).
[8]
Benno Stein,et al.
Strategies for retrieving plagiarized documents
,
2007,
SIGIR.
[9]
Laurent Romary,et al.
GROBID - Information Extraction from Scientific Publications
,
2015,
ERCIM News.
[10]
Efstathios Stamatatos,et al.
Plagiarism detection using stopword n-grams
,
2011,
J. Assoc. Inf. Sci. Technol..
[11]
M. Eberle.
Paraphrasing, Plagiarism, and Misrepresentation in Scientific Writing
,
2013
.
[12]
Matthias Hagen,et al.
Source Retrieval for Web-Scale Text Reuse Detection
,
2017,
CIKM.
[13]
Matthias Hagen,et al.
Wikipedia Text Reuse: Within and Without
,
2018,
ECIR.
[14]
Matthias Hagen,et al.
Overview of the 1st international competition on plagiarism detection
,
2009
.
[15]
Stephanie J Bird,et al.
Self-plagiarism and dual and redundant publications: What is the problem?
,
2002,
Science and engineering ethics.
[16]
Paul Ginsparg,et al.
Patterns of text reuse in a scientific corpus
,
2014,
Proceedings of the National Academy of Sciences.
[17]
Matthias Hagen,et al.
Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches
,
2015,
CLEF.
[18]
Zdenek Zdráhal,et al.
CORE: Three Access Levels to Underpin Open Access
,
2012,
D Lib Mag..
[19]
Q. Wen,et al.
Dual publication and academic inequality
,
2007
.
[20]
Ian G. Anson,et al.
Text recycling in STEM: A text-analytic study of recently published research articles
,
2020,
Accountability in research.
[21]
Cary Moskovitz,et al.
Attitudes toward text recycling in academic writing across disciplines
,
2018,
Accountability in research.