The Virtuous Cycle of a Data Ecosystem

Digital data of all types are being created at an ever-increasing rate, doubling approximately every two years. Annual data creation rates are estimated to reach 44 trillion gigabytes by 2020 [1]. Similarly, the rate at which primary scientific data are being collected is accelerating [2]. This astounding growth in scientific data creation has led to the contemporary discussion of scientific data sharing policies. Many of the criticisms levied against data sharing have focused on practical issues such as the economics and logistics of data storage, technical challenges for doing so, or appropriate attribution of credit [2–9]. In contrast, the arguments in favor of data sharing have focused largely on scientific replication, reproducibility [10], facilitation of collaborative research, and increased citations for publications that share data [11]. This is largely an ethical argument wherein there is an obligation to share data collected using public funds [3–6,12,13]. Rather than focusing on the much-discussed arguments against data sharing—cost, infrastructure, curation, privacy, and attribution/credit concerns—in this Perspective, I outline the overlooked benefits of data sharing: novel remixing and combining as well as bias minimization and meta-analysis. I argue that we must consider the weight of the costs against the true value of the possible benefits. If the decision for any individual researcher, university, or funding agency to implement data sharing policies comes down to a cost—benefit analysis based solely on replication versus storage, the cost—benefit analysis may be artificially tipped in favor of not sharing data caused by overlooking more subtle—but critical—benefits. These hidden benefits of data remixing cannot be appreciated when considering each individual dataset as an independent entity, and thus a richer consideration of those benefits is warranted. Although there is some evidence that, on the local scale, research groups may not make use of shared data [14], in this Perspective, I outline the ways in which research groups are beginning to take advantage of open data in novel, and sometimes surprising, ways. Rather than arguing for a centralized, large-scale data repository, I am advocating for a more organic development wherein we, institutionally, encourage the growth of a data ecosystem. This can be done via multiple venues, such as the general scientific data sharing sites figshare (https://figshare.com/) or the Dryad Digital Repository (http://datadryad.org/), each of which, in addition to Nature Publishing Group’s recently launched peer-reviewed data sharing journal, Scientific Data [15], provides citable Digital Object Identifiers for the data themselves. Such developments are addressing concerns regarding credit and help motivate data curation and contextualization. A data sharing ecosystem provides space for multiple diverse datasets to intermingle to encourage new, multidisciplinary discoveries for current and future scientists.

[1]  Roger Stone,et al.  The International Atmospheric Circulation Reconstructions over the Earth (ACRE) Initiative , 2011 .

[2]  Bradley Voytek,et al.  Automated cognome construction and semi-automated hypothesis generation , 2012, Journal of Neuroscience Methods.

[3]  Michael W. Carroll Sharing Research Data and Intellectual Property Law: A Primer , 2015, PLoS biology.

[4]  Jared Lyle,et al.  The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data , 2010, iPRES.

[5]  M. Burke,et al.  Quantifying the Influence of Climate on Human Conflict , 2013, Science.

[6]  Richard A. Gibbs,et al.  No Longer De-Identified , 2006, Science.

[7]  Martine Peeters,et al.  Hybrid Origin of SIV in Chimpanzees , 2003, Science.

[8]  Brian A. Nosek,et al.  Promoting an open research culture , 2015, Science.

[9]  Allan R. Jones,et al.  An anatomically comprehensive atlas of the adult human brain transcriptome , 2012, Nature.

[10]  Aniket Kittur,et al.  The Cognitive Atlas: Toward a Knowledge Foundation for Cognitive Neuroscience , 2011, Front. Neuroinform..

[11]  S. Levinson,et al.  WEIRD languages have misled us, too , 2010, Behavioral and Brain Sciences.

[12]  J. Ioannidis Why Most Published Research Findings Are False , 2019, CHANCE.

[13]  A. Vickers,et al.  Do certain countries produce only positive results? A systematic review of controlled trials. , 1998, Controlled clinical trials.

[14]  D. V. van Essen,et al.  Challenges and Opportunities in Mining Neuroscience Data , 2011, Science.

[15]  L. Hood,et al.  Leroy Hood expounds the principles, practice and future of systems biology. , 2003, Drug discovery today.

[16]  John H. Porter,et al.  The Ethics of Data Sharing and Reuse in Biology , 2013 .

[17]  N. Hawkins,et al.  Data sharing in genomics — re-shaping scientific practice , 2009, Nature Reviews Genetics.

[18]  Adam R Ferguson,et al.  Big data from small data: data-sharing in the 'long tail' of neuroscience , 2014, Nature Neuroscience.

[19]  C. Tenopir,et al.  Data Sharing by Scientists: Practices and Perceptions , 2011, PloS one.

[20]  Georgina M. Montgomery,et al.  It's Good to Share: Why Environmental Scientists’ Ethics Are Out of Date , 2014, Bioscience.

[21]  Amy L McGuire,et al.  Genetics. No longer de-identified. , 2006, Science.

[22]  C. B. Colby The weirdest people in the world , 1973 .

[23]  T. Yarkoni Psychoinformatics: New Horizons at the Interface of the Psychological and Computing Sciences , 2012 .

[24]  F. Berman,et al.  Who Will Pay for Public Access to Research Data? , 2013, Science.

[25]  O. Bertrand,et al.  Oscillatory activity of the human cerebellum: The intracranial electrocerebellogram revisited , 2013, Neuroscience & Biobehavioral Reviews.

[26]  Erez Lieberman,et al.  Quantifying the evolutionary dynamics of language , 2007, Nature.

[27]  Florence Debarre,et al.  The Availability of Research Data Declines Rapidly with Article Age , 2013, Current Biology.

[28]  Janet Currie,et al.  “Big Data” Versus “Big Brother”: On the Appropriate Use of Large-scale Data Collections in Pediatrics , 2013, Pediatrics.

[29]  Stephen H Koslow,et al.  Sharing primary data: a threat or asset to discovery? , 2002, Nature Reviews Neuroscience.

[30]  R. Soummer,et al.  ORBITAL MOTION OF HR 8799 b, c, d USING HUBBLE SPACE TELESCOPE DATA FROM 1998: CONSTRAINTS ON INCLINATION, ECCENTRICITY, AND STABILITY , 2011, 1110.1382.

[31]  R. C. Gerkin,et al.  Brain-wide analysis of electrophysiological diversity yields novel categorization of mammalian neuron types. , 2015, Journal of neurophysiology.

[32]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[33]  C. Lintott,et al.  Galaxy Zoo: the large-scale spin statistics of spiral galaxies in the Sloan Digital Sky Survey , 2008, 0803.3247.

[34]  M. Massagli,et al.  Accelerated clinical discovery using self-reported patient data collected online and a patient-matching algorithm , 2011, Nature Biotechnology.

[35]  Z. Popovic,et al.  Increased Diels-Alderase activity through backbone remodeling guided by Foldit players , 2012, Nature Biotechnology.

[36]  C. Borgman,et al.  If We Share Data, Will Anyone Use Them? Data Sharing and Reuse in the Long Tail of Science and Technology , 2013, PloS one.

[37]  Russell A. Poldrack,et al.  Large-scale automated synthesis of human functional neuroimaging data , 2011, Nature Methods.

[38]  R. MacCoun,et al.  Biases in the interpretation and use of research results. , 1998, Annual review of psychology.

[39]  Hein Putter,et al.  Persistent epigenetic differences associated with prenatal exposure to famine in humans , 2008, Proceedings of the National Academy of Sciences.

[40]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[41]  J. P. Hamaker,et al.  Image sharpness, Fourier optics, and redundant-spacing interferometry , 1977 .

[42]  Shawn D. Burton,et al.  NeuroElectro: a window to the world's neuron electrophysiology data , 2014, Front. Neuroinform..

[43]  D. Boyd,et al.  CRITICAL QUESTIONS FOR BIG DATA , 2012 .

[44]  D. V. Essen,et al.  Cognitive neuroscience 2.0: building a cumulative science of human brain function , 2010, Trends in Cognitive Sciences.

[45]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[46]  Russell A. Poldrack,et al.  Discovering Relations Between Mind, Brain, and Mental Disorders Using Topic Mapping , 2012, PLoS Comput. Biol..

[47]  Albert-László Barabási,et al.  Flavor network and the principles of food pairing , 2011, Scientific reports.

[48]  Theodor D. Sterling,et al.  Sharing scientific data , 1990, CACM.

[49]  I. Zucker,et al.  Sex bias in neuroscience and biomedical research , 2011, Neuroscience & Biobehavioral Reviews.

[50]  Scott T. Grafton,et al.  Sharing neuroimaging studies of human cognition , 2004, Nature Neuroscience.

[51]  Srinivas C. Turaga,et al.  Space-time wiring specificity supports direction selectivity in the retina , 2014, Nature.

[52]  Krzysztof J. Gorgolewski,et al.  Bridging psychology and genetics using large-scale spatial analysis of neuroimaging and neurogenetic data , 2014, bioRxiv.

[53]  Lief E. Fenno,et al.  The Microbial Opsin Family of Optogenetic Tools , 2011, Cell.

[54]  Alan C. Evans,et al.  OMEGA: The Open MEG Archive , 2016, NeuroImage.

[55]  Anthony Landreth,et al.  The Need for Research Maps to Navigate Published Work and Inform Experiment Planning , 2013, Neuron.