Who Shares? Who Doesn't? Factors Associated with Openly Archiving Raw Research Data

Many initiatives encourage investigators to share their raw datasets in hopes of increasing research efficiency and quality. Despite these investments of time and money, we do not have a firm grasp of who openly shares raw research data, who doesn't, and which initiatives are correlated with high rates of data sharing. In this analysis I use bibliometric methods to identify patterns in the frequency with which investigators openly archive their raw gene expression microarray datasets after study publication. Automated methods identified 11,603 articles published between 2000 and 2009 that describe the creation of gene expression microarray data. Associated datasets in best-practice repositories were found for 25% of these articles, increasing from less than 5% in 2001 to 30%–35% in 2007–2009. Accounting for sensitivity of the automated methods, approximately 45% of recent gene expression studies made their data publicly available. First-order factor analysis on 124 diverse bibliometric attributes of the data creation articles revealed 15 factors describing authorship, funding, institution, publication, and domain environments. In multivariate regression, authors were most likely to share data if they had prior experience sharing or reusing data, if their study was published in an open access journal or a journal with a relatively strong data sharing policy, or if the study was funded by a large number of NIH grants. Authors of studies on cancer and human subjects were least likely to make their datasets available. These results suggest research data sharing levels are still low and increasing only slowly, and data is least available in areas where it could make the biggest impact. Let's learn from those with high rates of sharing to embrace the full potential of our research output.

[1]  John Alan McDonald,et al.  Interactive graphics for data analysis , 1982 .

[2]  S. Fienberg,et al.  Sharing research data , 1985 .

[3]  Lee Sproull,et al.  What's Mine Is Ours, or Is It? A Study of Attitudes about Information Sharing , 1994, Inf. Syst. Res..

[4]  Katherine W. McCain,et al.  Mandating Sharing , 1995 .

[5]  Melissa S. Anderson,et al.  Withholding research results in academic life science. Evidence from a national survey of faculty. , 1997, JAMA.

[6]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[7]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[8]  P. Allotey,et al.  Data sharing in medical research: an empirical investigation. , 2001, Bioethics.

[9]  S. Hilgartner,et al.  Data withholding in academic genetics: evidence from a national survey. , 2002, JAMA.

[10]  Microarray standards at last , 2002, Nature.

[11]  Ingoo Han,et al.  Knowledge sharing behavior of physicians in hospitals , 2003, Expert Syst. Appl..

[12]  S. Eddy,et al.  Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences1 , 2003, Plant Physiology.

[13]  Cecelia M. Brown The changing face of scientific discourse: Analysis of genomic and proteomic database usage and acceptance , 2003, J. Assoc. Inf. Sci. Technol..

[14]  P. Brown,et al.  Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15]  C. Ball,et al.  Submission of Microarray Data to Public Repositories , 2004, PLoS biology.

[16]  Kerry K Kakazu,et al.  The Cancer Biomedical Informatics Grid (caBIG): pioneering an expansive network of information and tools for collaborative cancer research. , 2004, Hawaii medical journal.

[17]  Beverly Ventura Mandatory submission of microarray data to public repositories: how is it working? , 2005, Physiological genomics.

[18]  Jürgen Bitzer,et al.  Intrinsic motivation in open source software development , 2007 .

[19]  C. Street,et al.  The Cancer Biomedical Informatics Grid (caBIGTM) , 2005, 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference.

[20]  Sandra H. Berry,et al.  Gender Differences in Major Federal External Grant Programs , 2005 .

[21]  John P A Ioannidis,et al.  Selective reporting biases in cancer prognostic factor studies. , 2005, Journal of the National Cancer Institute.

[22]  M. Noor,et al.  Data Sharing: How Much Doesn't Get Submitted to GenBank? , 2006, PLoS biology.

[23]  C. Medeiros Got data? , 2006, Marketing health services.

[24]  Stephen Hilgartner,et al.  Data Withholding in Genetics and the Other Life Sciences: Prevalences and Predictors , 2006, Academic medicine : journal of the Association of American Medical Colleges.

[25]  Paul Dourish,et al.  The human infrastructure of cyberinfrastructure , 2006, CSCW '06.

[26]  Lowrance Wm Access to Collections of Data and Material for Health Research. A report to the Medical Research Council and the Wellcome Trust , 2006 .

[27]  Melissa S. Anderson,et al.  Data Withholding and the Next Generation of Scientists: Results of a National Survey , 2006, Academic medicine : journal of the Association of American Medical Colleges.

[28]  Richard Giordano,et al.  The Scientist: Secretive, Selfish or Reticent? A Social Network Analysis , 2006 .

[29]  Margaret L. Hedstrom Producing Archive-Ready Datasets: Compliance, Incentives, and Motivation , 2006, IASSIST Conference.

[30]  Jason Barringer,et al.  Time for leadership , 2007, Nature Biotechnology.

[31]  Muin J. Khoury,et al.  An automatic method to generate domain-specific investigator networks using PubMed abstracts , 2007, BMC Medical Informatics Decis. Mak..

[32]  Paul Dan Cristea,et al.  Adaptive Hypermedia System Interoperability: a 'real world' evaluation , 2007, J. Digit. Inf..

[33]  P. Donnelly,et al.  New models of collaboration in genome-wide association studies: the Genetic Association Information Network , 2007, Nature Genetics.

[34]  Stefanie E Warlick,et al.  Factors influencing publication choice: why faculty choose open access , 2007, Biomedical Digital Libraries.

[35]  Dennis B. Troup,et al.  NCBI GEO: mining tens of millions of expression profiles—database and tools update , 2006, Nucleic Acids Res..

[36]  Richard R. Sharp,et al.  Share and share alike: deciding how to distribute the scientific and social benefits of genomic data , 2007, Nature Reviews Genetics.

[37]  Helen E. Parkinson,et al.  ArrayExpress—a public database of microarray experiments and gene expression profiles , 2006, Nucleic Acids Res..

[38]  Jihyun Kim,et al.  Motivating and Impeding Factors Affecting Faculty Contribution to Institutional Repositories , 2007, J. Digit. Inf..

[39]  Birgit Renzl,et al.  Personality traits and knowledge sharing , 2008 .

[40]  Teresa D. Harrison,et al.  Do Economics Journal Archives Promote Replicable Research? , 2006 .

[41]  Terrance Kennedy Mills Time for leadership , 2008 .

[42]  Wendy W. Chapman,et al.  A review of journal policies for sharing research data , 2008, ELPUB.

[43]  Robert Navarro An ethical framework for sharing patient data without consent. , 2008, Informatics in primary care.

[44]  Aleda V. Roth,et al.  How motivation, opportunity, and ability drive knowledge sharing: The constraining-factor model , 2008 .

[45]  Christian J Stoeckert,et al.  Much room for improvement in deposition rates of expression microarray datasets , 2008, Nature Methods.

[46]  Kim Seonghee,et al.  An analysis of faculty perceptions: Attitudes toward knowledge sharing and collaboration in an academic institution , 2008 .

[47]  Feng-Yang Kuo,et al.  A study of the intention-action gap in knowledge sharing practices , 2008, J. Assoc. Inf. Sci. Technol..

[48]  Neil R. Smalheiser,et al.  Author name disambiguation in MEDLINE , 2009, TKDD.

[49]  D. Altman,et al.  Towards agreement on best practice for publishing raw clinical trial data , 2009, Trials.

[50]  Simon Urbanek,et al.  Interactive graphics for Data Analysis - Principles and Examples , 2008, Computer science and data analysis series.

[51]  Lutz Bornmann,et al.  Do we need the h index and its variants in addition to standard bibliometric measures? , 2009, J. Assoc. Inf. Sci. Technol..

[52]  Lutz Bornmann,et al.  Do we need the h index and its variants in addition to standard bibliometric measuresq , 2009 .

[53]  Jennifer Tucker,et al.  Motivating Subjects: Data Sharing in Cancer Research , 2009 .

[54]  Heather A. Piwowar,et al.  Foundational studies for measuring the impact, prevalence, and patterns of publicly sharing biomedical research data , 2010 .

[55]  Bradley Malin,et al.  Technical and Policy Approaches to Balancing Patient Privacy and Data Sharing in Clinical and Translational Research , 2010, Journal of Investigative Medicine.

[56]  Wendy W. Chapman,et al.  Public sharing of research datasets: A pilot study of associations , 2010, J. Informetrics.

[57]  Wendy W Chapman,et al.  Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers , 2010, Journal of biomedical discovery and collaboration.

[58]  Mark Walport,et al.  Sharing research data to improve public health , 2011, The Lancet.