A Semantic-Based Approach to Attain Reproducibility of Computational Environments in Scientific Workflows: A Case Study

Reproducible research in scientic work ows is often addressed by tracking the provenance of the produced results. While this approach allows inspecting intermediate and nal results, improves understanding, and permits replaying a work ow execution, it does not ensure that the computational environment is available for subsequent executions to reproduce the experiment. In this work, we propose describing the resources involved in the execution of an experiment using a set of semantic vocabularies, so as to conserve the computational environment. We dene a process for documenting the work ow application, management system, and their dependencies based on 4 domain ontologies. We then conduct an experimental evaluation sing a real work ow application on an academic and a public Cloud platform. Results show that our approach can reproduce an equivalent execution environment of a predened virtual machine image on both computing platforms.

[1]  Anne Lohrli Chapman and Hall , 1985 .

[2]  Miron Livny,et al.  Distributed computing in practice: the Condor experience: Research Articles , 2005 .

[3]  C. Drummond Replicability is not Reproducibility:Nor is it Good Science , 2009 .

[4]  Bill Howe,et al.  Virtual Appliances, Cloud Computing, and Reproducible Research , 2012, Computing in Science & Engineering.

[5]  Gonçalo Antunes,et al.  Digital Preservation of a Process and its Application to e-Science Experiments , 2013, iPRES.

[6]  Pierre Sens,et al.  Stream Processing of Healthcare Sensor Data: Studying User Traces to Identify Challenges from a Big Data Perspective , 2015, ANT/SEIT.

[7]  Carole A. Goble,et al.  Why workflows break — Understanding and combating decay in Taverna workflows , 2012, 2012 IEEE 8th International Conference on E-Science.

[8]  Philippe Bonnet,et al.  Repeatability and workability evaluation of SIGMOD 2011 , 2011, SGMD.

[9]  Oscar Corcho,et al.  Workflow-centric research objects: First class citizens in scholarly discourse. , 2012, ESWC 2012.

[10]  Jim Woodcock,et al.  Towards a Methodology for Software Preservation , 2009, iPRES.

[11]  Steffen Mazanek,et al.  SHARE: a web portal for creating and sharing executable research papers , 2011, ICCS.

[12]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[13]  Ewa Deelman,et al.  Introducing PRECIP: An API for Managing Repeatable Experiments in the Cloud , 2013, 2013 IEEE 5th International Conference on Cloud Computing Technology and Science.

[14]  Idafen Santana-Perez,et al.  Towards Reproducibility in Scientific Workflows: An Infrastructure-Based Approach , 2015, Sci. Program..

[15]  Victoria Stodden,et al.  Reproducible Research , 2019, The New Statistics with R.

[16]  Daniel S. Katz,et al.  Montage: a grid-enabled engine for delivering custom science-grade mosaics on demand , 2004, SPIE Astronomical Telescopes + Instrumentation.

[17]  Carole A. Goble,et al.  Designing the myExperiment Virtual Research Environment for the Social Sharing of Workflows , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).

[18]  Suzanne J. Matthews,et al.  Paper Mâché: Creating Dynamic Reproducible Science , 2011, ICCS.

[19]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[20]  David L. Donoho,et al.  A Universal Identifier for Computational Results , 2011, ICCS.

[21]  Hong Jiang,et al.  Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud , 2014, TOS.

[22]  Dennis Shasha,et al.  ReproZip: Using Provenance to Support Computational Reproducibility , 2013, TaPP.

[23]  Brigid Wilson,et al.  Implementing Reproducible Research , 2014 .

[24]  Yang Zhang,et al.  Liquid: A Scalable Deduplication File System for Virtual Machine Images , 2014, IEEE Transactions on Parallel and Distributed Systems.