Provenance and the Different Flavors of Reproducibility

While reproducibility has been a requirement in natural sciences for centuries, computational experiments have not followed the same standard. Often, there is insufficient information to reproduce computational results described in publications, and in the recent past, this has led to many retractions. Although scientists are aware of the numerous benefits of reproducibility, the perceived amount of work to make results reproducible is a significant disincentive. Fortunately, much of the information needed to reproduce an experiment can be obtained by systematically capturing its provenance. In this paper, we give an overview of different types of provenance and how they can be used to support reproducibility. We also describe a representative set of provenance tools and approaches that make it easy to create reproducible experiments.

[1]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[2]  James Cheney,et al.  Provenance in Databases: Why, How, and Where , 2009, Found. Trends Databases.

[3]  Yves Janin,et al.  CARE, the comprehensive archiver for reproducible execution , 2014, TRUST '14.

[4]  Barbara Lerner,et al.  RDataTracker: Collecting Provenance in an Interactive Scripting Environment , 2014, TAPP.

[5]  Tim Brody,et al.  Evaluating Research Impact through Open Access to Scholarly Communication , 2006 .

[6]  Venkatesh Radhakrishnan,et al.  A Generic Provenance Middleware for Queries, Updates, and Transactions , 2014, TAPP.

[7]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[8]  Juliana Freire,et al.  Tracking and Analyzing the Evolution of Provenance from Scripts , 2016, IPAW.

[9]  Andrew P. Davison,et al.  Learning from the Past: Approaches for Reproducibility in Computational Neuroscience , 2013 .

[10]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[11]  C. Begley,et al.  Drug development: Raise standards for preclinical cancer research , 2012, Nature.

[12]  Arian Maleki,et al.  Reproducible Research in Computational Harmonic Analysis , 2009, Computing in Science & Engineering.

[13]  Cláudio T. Silva,et al.  Querying and Creating Visualizations by Analogy , 2007, IEEE Transactions on Visualization and Computer Graphics.

[14]  Gustavo Alonso,et al.  Perm: Processing Provenance and Data on the Same Data Model through Query Rewriting , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[15]  B. Nyenzi,et al.  GLOSSARY , 2019, Evidence-Based Dentistry.

[16]  Juliana Freire,et al.  Provenance and Reproducibility , 2018, Encyclopedia of Database Systems.

[17]  Cláudio T. Silva,et al.  Querying and re-using workflows with VsTrails , 2008, SIGMOD Conference.

[18]  Richard Van Noorden Science publishing: The trouble with retractions , 2011, Nature.

[19]  Ashish Gehani,et al.  Towards Automated Collection of Application-Level Data Provenance , 2012, TaPP.

[20]  Donald A. Norman,et al.  Things That Make Us Smart: Defending Human Attributes In The Age Of The Machine , 1993 .

[21]  Philippe Bonnet,et al.  Computational reproducibility: state-of-the-art, challenges, and database research opportunities , 2012, SIGMOD Conference.

[22]  Monya Baker,et al.  Muddled meanings hamper efforts to fix reproducibility crisis , 2016, Nature.

[23]  Bertram Ludäscher,et al.  Scientific workflow management and the Kepler system: Research Articles , 2006 .

[24]  Randall J. LeVeque,et al.  Python Tools for Reproducible Research on Hyperbolic Problems , 2009, Computing in Science & Engineering.

[25]  Philippe Bonnet,et al.  Repeatability and workability evaluation of SIGMOD 2011 , 2011, SGMD.

[26]  J. Bohannon Who's afraid of peer review? , 2013, Science.

[27]  Dennis Shasha,et al.  ReproZip: Using Provenance to Support Computational Reproducibility , 2013, TaPP.

[28]  Christian S. Collberg,et al.  Repeatability in computer systems research , 2016, Commun. ACM.

[29]  Cláudio T. Silva,et al.  The Provenance of Workflow Upgrades , 2010, IPAW.

[30]  Dennis Shasha,et al.  ReproZip: The Reproducibility Packer , 2016, J. Open Source Softw..

[31]  C. Tenopir,et al.  Data Sharing by Scientists: Practices and Perceptions , 2011, PloS one.

[32]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[33]  Yaxing Wei,et al.  YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts , 2015, ArXiv.

[34]  Cláudio T. Silva,et al.  Bridging Workflow and Data Provenance Using Strong Links , 2010, SSDBM.

[35]  S. Lawrence Free online availability substantially increases a paper's impact , 2001, Nature.

[36]  Peter Buneman,et al.  Provenance in databases , 2009, SIGMOD '07.

[37]  Philip J. Guo CDE: A Tool for Creating Portable Experimental Software Packages , 2012, Computing in Science & Engineering.

[38]  Ian T. Foster,et al.  Using Provenance for Repeatability , 2013, TaPP.

[39]  Juliana Freire,et al.  noWorkflow: Capturing and Analyzing Provenance of Scripts , 2014, IPAW.

[40]  R. Nuzzo How scientists fool themselves – and how they can stop , 2015, Nature.

[41]  Andreas Wombacher,et al.  ProvenanceCurious: a tool to infer data provenance from scripts , 2013, EDBT '13.

[42]  Val Tannen,et al.  Querying data provenance , 2010, SIGMOD Conference.

[43]  Andreas Wombacher,et al.  Facilitating fine grained data provenance using temporal data model , 2010, DMSN '10.

[44]  Steve Hitchcock,et al.  The effect of open access and downloads ('hits') on citation impact: a bibliography of studies , 2004 .

[45]  Cláudio T. Silva,et al.  Managing Rapidly-Evolving Scientific Workflows , 2006, IPAW.

[46]  Juliana Freire,et al.  Reproducibility of Data-Oriented Experiments in e-Science (Dagstuhl Seminar 16041) , 2016, Dagstuhl Reports.

[47]  Val Tannen,et al.  Update Exchange with Mappings and Provenance , 2007, VLDB.

[48]  Heather A. Piwowar,et al.  Sharing Detailed Research Data Is Associated with Increased Citation Rate , 2007, PloS one.

[49]  Alex M. Warren Repeatability and Benefaction in Computer Systems Research — A Study and a Modest Proposal , 2015 .

[50]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[51]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[52]  Bertram Ludäscher,et al.  Yin & Yang: Demonstrating Complementary Provenance from noWorkflow & YesWorkflow , 2016, IPAW.

[53]  Carole A. Goble,et al.  Taverna, Reloaded , 2010, SSDBM.

[54]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.