The citation advantage of linking publications to research data

Efforts to make research results open and reproducible are increasingly reflected by journal policies encouraging or mandating authors to provide data availability statements. As a consequence of this, there has been a strong uptake of data availability statements in recent literature. Nevertheless, it is still unclear what proportion of these statements actually contain well-formed links to data, for example via a URL or permanent identifier, and if there is an added value in providing such links. We consider 531, 889 journal articles published by PLOS and BMC, develop an automatic system for labelling their data availability statements according to four categories based on their content and the type of data availability they display, and finally analyze the citation advantage of different statement categories via regression. We find that, following mandated publisher policies, data availability statements become very common. In 2018 93.7% of 21,793 PLOS articles and 88.2% of 31,956 BMC articles had data availability statements. Data availability statements containing a link to data in a repository—rather than being available on request or included as supporting information files—are a fraction of the total. In 2017 and 2018, 20.8% of PLOS publications and 12.2% of BMC publications provided DAS containing a link to data in a repository. We also find an association between articles that include statements that link to data in a repository and up to 25.36% (± 1.07%) higher citation impact on average, using a citation prediction model. We discuss the potential implications of these results for authors (researchers) and journal publishers who make the effort of sharing their data in repositories. All our data and code are made available in order to reproduce and extend our results.

[1]  Zhiyong Lu,et al.  PubMed and beyond: a survey of web tools for searching biomedical literature , 2011, Database J. Biol. Databases Curation.

[2]  Jian Wang,et al.  Bias Against Novelty in Science: A Cautionary Tale for Users of Bibliometric Indicators , 2015 .

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  Luís Torgo,et al.  Data Mining with R: Learning with Case Studies , 2010 .

[5]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[6]  Michael C. Frank,et al.  Data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal Cognition , 2018, Royal Society Open Science.

[7]  Isola Ajiferuke,et al.  Modelling count response variables in informetric studies: Comparison among count, linear, and lognormal regression models , 2015, J. Informetrics.

[8]  Iain Hrynaszkiewicz,et al.  Standardising and Harmonising Research Data Policy in Scholary Publishing , 2017, Int. J. Digit. Curation.

[9]  F. Markowetz Five selfish reasons to work reproducibly , 2015, Genome Biology.

[10]  Gabriel Popkin,et al.  Data sharing and how it can benefit your scientific career , 2019, Nature.

[11]  Hadley Wickham,et al.  R for Data Science: Import, Tidy, Transform, Visualize, and Model Data , 2014 .

[12]  Wanli Liu,et al.  Author Name Disambiguation for PubMed , 2013, J. Assoc. Inf. Sci. Technol..

[13]  Awad Aubad,et al.  Towards a framework building for social systems modelling , 2020 .

[14]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[15]  Announcement: Where are the data? , 2016, Nature.

[16]  David Moher,et al.  Data sharing and reanalysis of randomized controlled trials in leading biomedical journals with a full data sharing policy: survey of studies published in The BMJ and PLOS Medicine , 2018, British Medical Journal.

[17]  S. Hodson,et al.  Current Best Practice for Research Data Management Policies , 2014 .

[18]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[19]  T. Yee The VGAM Package for Categorical Data Analysis , 2010 .

[20]  Heather A. Piwowar,et al.  Sharing Detailed Research Data Is Associated with Increased Citation Rate , 2007, PloS one.

[21]  Sam Yeaman,et al.  Mandated data archiving greatly improves access to research data , 2013, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[22]  Éric Archambault,et al.  Towards a Multilingual, Comprehensive and Open Scientific Journal Ontology , 2013 .

[23]  New policy for structural data , 1998, Nature.

[24]  Adrian Bowman,et al.  Generalized additive models for location, scale and shape - Discussion , 2005 .

[25]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[26]  I. Ràfols,et al.  Does Interdisciplinary Research Lead to Higher Citation Impact? The Different Effect of Proximal and Distal Interdisciplinarity , 2015, PloS one.

[27]  Kevin W. Boyack,et al.  Which Type of Citation Analysis Generates the Most Accurate Taxonomy of Scientific and Technical Knowledge? , 2015, J. Assoc. Inf. Sci. Technol..

[28]  Marcos André Gonçalves,et al.  A brief survey of automatic methods for author name disambiguation , 2012, SGMD.

[29]  Kevin W. Boyack,et al.  Investigating the effect of global data on topic detection , 2017, Scientometrics.

[30]  Nicole A. Vasilevsky,et al.  Reproducible and reusable research: are journal data sharing policies meeting the mark? , 2017, PeerJ.

[31]  Iain Hrynaszkiewicz,et al.  Standardising and harmonising research data policy in scholarly publishing , 2017 .

[32]  A. Treloar,et al.  Open Data in Global Environmental Research: The Belmont Forum’s Open Data Survey , 2016, PloS one.

[33]  Arno Klein,et al.  Assessment of the impact of shared brain imaging data on the scientific literature , 2018, Nature Communications.

[34]  Søren Bertil F. Dorch,et al.  The data sharing advantage in astrophysics , 2015, Proceedings of the International Astronomical Union.

[35]  Vincent Larivière,et al.  Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research , 2010, PloS one.

[36]  J. R. Sears,et al.  Data Sharing Effect on Article Citation Rate in Paleoceanography , 2011 .

[37]  R. Grant,et al.  Implementing publisher policies that inform, support and encourage authors to share data: two case studies , 2019, Insights the UKSG journal.

[38]  Heng Ji,et al.  Entity linking for biomedical literature , 2014, DTMBIO '14.

[39]  J. Ioannidis,et al.  Reproducible research practices, transparency, and open access data in the biomedical literature, 2015–2017 , 2018, PLoS biology.

[40]  Hyoungjoo Park,et al.  Research software citation in the Data Citation Index: Current practices and implications for research software sharing and reuse , 2019, J. Informetrics.

[41]  Sune Lehmann,et al.  The chaperone effect in scientific publishing , 2018, Proceedings of the National Academy of Sciences.

[42]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[43]  Sikun Li,et al.  An incremental extremely random forest classifier for online learning and tracking , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[44]  Marcia McNutt,et al.  Data sharing , 2016, Science.

[45]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[46]  Mike Thelwall,et al.  The discretised lognormal and hooked power law distributions for complete citation data: Best options for modelling and regression , 2016, J. Informetrics.

[47]  Sabina Leonelli,et al.  The State of Open Data Report , 2016 .

[48]  Anisa Rowhani-Farid,et al.  Has open data arrived at the British Medical Journal (BMJ)? An observational study , 2016, BMJ Open.

[49]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[50]  Cameron Neylon,et al.  Building a Culture of Data Sharing: Policy Design and Implementation for Research Data Management in Development Research , 2017 .

[51]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[52]  C. Ball,et al.  Repeatability of published microarray gene expression analyses , 2009, Nature Genetics.

[53]  A. Rose,et al.  A study of the impact of data sharing on article citations using journal policies as a natural experiment , 2019, PloS one.

[54]  Christopher W. Belter,et al.  Data sharing in PLOS ONE: An analysis of Data Availability Statements , 2018, PloS one.

[55]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[56]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[57]  Mike Thelwall,et al.  Regression for citation data: An evaluation of different methods , 2014, J. Informetrics.

[58]  Iain Hrynaszkiewicz,et al.  The impact on authors and editors of introducing Data Availability Statements at Nature journals , 2018, bioRxiv.

[59]  D. Borsboom,et al.  The poor availability of psychological research data for reanalysis. , 2006, The American psychologist.

[60]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[61]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[62]  Heather A. Piwowar,et al.  Data reuse and the open data citation advantage , 2013, PeerJ.

[64]  Alberto Accomazzi,et al.  Linking to Data - Effect on Citation Rates in Astronomy , 2011, ArXiv.

[65]  Jana Diesner,et al.  Distortive effects of initial‐based name disambiguation on measurements of large‐scale coauthorship networks , 2015, J. Assoc. Inf. Sci. Technol..

[66]  Marek Hlavac stargazer : LaTeX code for well-formatted regression and summary statistics tables (R package) , 2012 .

[67]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[68]  R. Rigby,et al.  Generalized additive models for location, scale and shape , 2005 .

[69]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[70]  Farid Neema,et al.  Data sharing , 1998 .

[71]  Andreas Strotmann,et al.  Author name disambiguation: What difference does it make in author-based citation analysis? , 2012, J. Assoc. Inf. Sci. Technol..

[72]  G. Cumming,et al.  The influence of journal submission guidelines on authors' reporting of statistics and use of open research practices , 2017, PloS one.

[73]  D B Struck,et al.  Modelling the Effects of Open Access, Gender and Collaboration on Citation Outcomes: Replicating, Expanding and Drilling , 2018 .