If these data could talk

In the last few decades, data-driven methods have come to dominate many fields of scientific inquiry. Open data and open-source software have enabled the rapid implementation of novel methods to manage and analyze the growing flood of data. However, it has become apparent that many scientific fields exhibit distressingly low rates of reproducibility. Although there are many dimensions to this issue, we believe that there is a lack of formalism used when describing end-to-end published results, from the data source to the analysis to the final published results. Even when authors do their best to make their research and data accessible, this lack of formalism reduces the clarity and efficiency of reporting, which contributes to issues of reproducibility. Data provenance aids both reproducibility through systematic and formal records of the relationships among data sources, processes, datasets, publications and researchers.

[1]  Jessica Gurevitch,et al.  Transparency in Ecology and Evolution: Real Problems, Real Solutions. , 2016, Trends in ecology & evolution.

[2]  Margo I. Seltzer,et al.  Issues in Automatic Provenance Collection , 2006, IPAW.

[3]  James Frew,et al.  Earth System Science Workbench: a data management infrastructure for earth science products , 2001, Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001.

[4]  Jeffrey T. Leek,et al.  Is most published research really false? , 2016, bioRxiv.

[5]  Margo I. Seltzer,et al.  Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs , 2011, TaPP.

[6]  Monya Baker,et al.  Cancer reproducibility project releases first results , 2017, Nature.

[7]  Dominique Breton,et al.  LHCb computing : Technical Design Report , 2005 .

[8]  Leon J. Osterweil,et al.  AN ANALYTIC WEB TO SUPPORT THE ANALYSIS AND SYNTHESIS OF ECOLOGICAL DATA , 2004 .

[9]  Allen R. Hanson,et al.  Analytic webs support the synthesis of ecological data sets. , 2006, Ecology.

[10]  Margo I. Seltzer,et al.  A General-Purpose Provenance Library , 2012, TaPP.

[11]  Vijay Gadepally,et al.  High-throughput ingest of data provenance records into Accumulo , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[12]  Daniel J. Jacob,et al.  Seasonal variation of the ozone production efficiency per unit NOx at Harvard Forest, Massachusetts , 1996 .

[13]  Margo I. Seltzer,et al.  Layering in Provenance Systems , 2009, USENIX Annual Technical Conference.

[14]  Joel A. Granados,et al.  Using phenocams to monitor our changing Earth: toward a global phenocam network , 2016 .

[15]  Dennis Shasha,et al.  ReproZip: Computational Reproducibility With Ease , 2016, SIGMOD Conference.

[16]  Brooks Hanson,et al.  Liberating field science samples and data , 2016, Science.

[17]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[18]  A. Ellison,et al.  Thermal reactionomes reveal divergent responses to thermal extremes in warm and cool-climate ant species , 2016, BMC Genomics.

[19]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[20]  Aaron M Ellison,et al.  Modeling range dynamics in heterogeneous landscapes: invasion of the hemlock woolly adelgid in eastern North America. , 2012, Ecological applications : a publication of the Ecological Society of America.

[21]  Wolfram H. P. Pernice,et al.  Waveguide integrated superconducting single-photon detector for on-chip quantum and spectral photonic application , 2017 .

[22]  F. Riggi,et al.  The upgrade programme of the major experiments at the Large Hadron Collider , 2014 .

[23]  Daniel Sarewitz,et al.  The pressure to publish pushes down quality , 2016, Nature.

[24]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[25]  Philippe Bonnet,et al.  A Provenance-Based Infrastructure to Support the Life Cycle of Executable Papers , 2011, ICCS.

[26]  Lori A. Clarke,et al.  Ensuring reliable datasets for environmental models and forecasts , 2007, Ecol. Informatics.