Four level provenance support to achieve portable reproducibility of scientific workflows

In the scientist's community one of the most vital challenges is the issue of reproducibility of workflow execution. In order to reproduce the results of an experiment, on one hand provenance information must be collected and on the other hand the dependencies of the execution need to be eliminated. Concerning the workflow execution environment we have differentiated four levels of provenance: infrastructural, environmental, workflow and data provenance. During the re-execution at all levels the components can change and capturing the data of each levels targets different problems to solve. For example storing the environmental and infrastructural parameters enables the portability of workflows between the different parallel and distributed systems (grid, HPC, cloud). The describers of the workflow model enable tracking the different versions of the workflow and their impacts on the execution. Our goal is to capture the most optimal parameters in number and type as well and reconstruct the way of data production independently from the environment. In this paper we investigate the necessary and satisfactory parameters of workflow reproducibility and give a mathematical formula to determine the rate of reproducibility. These measurements allow the scientist to make a decision about the next steps toward the creation of reproducible workflows.

[1]  Carole A. Goble,et al.  Towards the Preservation of Scientific Workflows , 2011, iPRES.

[2]  Paul Watson,et al.  Provenance and data differencing for workflow reproducibility analysis , 2016, Concurr. Comput. Pract. Exp..

[3]  Anupam Joshi,et al.  PROB: A tool for Tracking Provenance and Reproducibility of Big Data Experiments , 2014, HPCA 2014.

[4]  Dennis Shasha,et al.  ReproZip: Using Provenance to Support Computational Reproducibility , 2013, TaPP.

[5]  Paul Watson,et al.  Achieving reproducibility by combining provenance with service and workflow versioning , 2011, WORKS '11.

[6]  Philippe Bonnet,et al.  A Provenance-Based Infrastructure to Support the Life Cycle of Executable Papers , 2011, ICCS.

[7]  Carole A. Goble,et al.  Why workflows break — Understanding and combating decay in Taverna workflows , 2012, 2012 IEEE 8th International Conference on E-Science.

[8]  Cláudio T. Silva,et al.  Reproducibility using VisTrails , 2014 .

[9]  James Cheney,et al.  Principles of Provenance (Dagstuhl Seminar 12091) , 2012, Dagstuhl Reports.

[10]  Paul T. Groth,et al.  Pipeline-centric provenance model , 2009, WORKS '09.

[11]  Marta Mattoso,et al.  Towards a Taxonomy of Provenance in Scientific Workflow Management Systems , 2009, 2009 Congress on Services - I.

[12]  Cláudio T. Silva,et al.  Enabling Reproducible Science with VisTrails , 2013, ArXiv.

[13]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[14]  Andrew P. Davison Automated Capture of Experiment Context for Easier Reproducibility in Computational Research , 2012, Computing in Science & Engineering.

[15]  Jill P Mesirov,et al.  Accessible Reproducible Research , 2010, Science.

[16]  Carole A. Goble,et al.  Best Practices for Workflow Design: How to Prevent Workflow Decay , 2012, SWAT4LS.