Sharing and performance optimization of reproducible workflows in the cloud

Abstract Scientific workflows play a vital role in modern science as they enable scientists to specify, share and reuse computational experiments. To maximizethe benefits, workflows need to support the reproducibility of the experimental methods they capture. Reproducibility enables effective sharing as scientists can re-execute experiments developed by others and quickly derive new or improved results. However, achieving reproducibility in practice is problematic — previous analyses highlight issues due to uncontrolled changes in the input data, configuration parameters, workflow description and the software used to implement the workflow tasks. The resulting problems have become known as workflow decay. In this paper we present a novel framework that addresses workflow decay through the integration of system description, version control, container management and automated deployment techniques. It then introduces a set of performance optimization techniques that significantly reduce the runtime overheads caused by making workflows reproducible. The resulting system significantly improves the performance, repeatability and also the ability to share and re-use workflows by combining a method to uniquely identify task and workflow images with an automated image capture facility and a multi-level cache. The system is evaluated through an extensive set of experiments that validate the approach and highlight the key benefits of the proposed optimizations. This includes methods for reducing the runtime of workflows by up to an order of magnitude in cases where they are enacted concurrently on the same host VM and in different Clouds, and where they share tasks.

[1]  Jano I. van Hemert,et al.  Scientific Workflow: A Survey and Research Directions , 2007, PPAM.

[2]  Ewa Deelman,et al.  Pegasus in the Cloud: Science Automation through Workflow Technologies , 2016, IEEE Internet Computing.

[3]  Paul Watson,et al.  Cloud computing for fast prediction of chemical activity , 2013, Future Gener. Comput. Syst..

[4]  Pericles A. Mitkas,et al.  Hermes: Seamless delivery of containerized bioinformatics workflows in hybrid cloud (HTC) environments , 2017, SoftwareX.

[5]  Carole A. Goble,et al.  Taverna, Reloaded , 2010, SSDBM.

[6]  Craig Willis,et al.  Preserving Reproducibility: Provenance and Executable Containers in DataONE Data Packages , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[7]  Bill Howe,et al.  Virtual Appliances, Cloud Computing, and Reproducible Research , 2012, Computing in Science & Engineering.

[8]  Carole A. Goble,et al.  Using a suite of ontologies for preserving workflow-centric research objects , 2015, J. Web Semant..

[9]  Victoria Stodden,et al.  Implementing Reproducible Research , 2018 .

[10]  Ola Spjuth,et al.  BioImg.org: A Catalog of Virtual Machine Images for the Life Sciences , 2015, Bioinformatics and biology insights.

[11]  Carole A. Goble,et al.  Best Practices for Workflow Design: How to Prevent Workflow Decay , 2012, SWAT4LS.

[12]  Ian T. Foster,et al.  Auditing and Maintaining Provenance in Software Packages , 2014, IPAW.

[13]  Carole A. Goble,et al.  Why workflows break — Understanding and combating decay in Taverna workflows , 2012, 2012 IEEE 8th International Conference on E-Science.

[14]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[15]  Ewa Deelman,et al.  Scientific Workflows in the Cloud , 2011 .

[16]  Lars Kotthoff,et al.  Case Studies and Challenges in Reproducibility in the Computational Sciences , 2014, 1408.2123.

[17]  Paul Watson,et al.  Developing cloud applications using the e-Science Central platform , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[18]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[19]  Paul Watson,et al.  A framework for scientific workflow reproducibility in the cloud , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[20]  Ingrid Nunes,et al.  Understanding Application-Level Caching in Web Applications , 2017, ACM Comput. Surv..

[21]  Zhenyu Wen,et al.  Cost Effective, Reliable, and Secure Workflow Deployment over Federated Clouds , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[22]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[23]  Liya Wang,et al.  SciApps: a cloud-based platform for reproducible bioinformatics workflows , 2018, Bioinform..

[24]  Martin Hofmann-Apitius,et al.  A new optimization phase for scientific workflow management systems , 2012, 2012 IEEE 8th International Conference on E-Science.

[25]  Oliver Kopp,et al.  TOSCA: Portable Automated Deployment and Management of Cloud Applications , 2014, Advanced Web Services.

[26]  Kenji Takeda,et al.  Scalable and efficient whole-exome data processing using workflows on the cloud , 2016, Future Gener. Comput. Syst..

[27]  Mehdi Cherti,et al.  The RAMP framework: from reproducibility to transparency in the design and optimization of scientific workflows , 2018 .

[28]  Carole A. Goble,et al.  myExperiment: a repository and social network for the sharing of bioinformatics workflows , 2010, Nucleic Acids Res..

[29]  Alban Gaignard,et al.  Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities , 2017, Future Gener. Comput. Syst..

[30]  Frank Leymann,et al.  Portable Cloud Services Using TOSCA , 2012, IEEE Internet Computing.

[31]  María S. Pérez-Hernández,et al.  Reproducibility of execution environments in computational science using Semantics and Clouds , 2017, Future Gener. Comput. Syst..

[32]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[33]  Dennis Shasha,et al.  ReproZip: Using Provenance to Support Computational Reproducibility , 2013, TaPP.

[34]  Venkatram Vishwanath,et al.  Workflow performance improvement using model-based scheduling over multiple clusters and clouds , 2016, Future Gener. Comput. Syst..

[35]  Juliana Freire,et al.  Tracking and Analyzing the Evolution of Provenance from Scripts , 2016, IPAW.

[36]  Georgia Kougka,et al.  The many faces of data-centric workflow optimization: a survey , 2017, International Journal of Data Science and Analytics.

[37]  Rolf Backofen,et al.  Practical computational reproducibility in the life sciences , 2017, bioRxiv.

[38]  Paul Watson,et al.  Towards Automated Workflow Deployment in the Cloud Using TOSCA , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[39]  Ilya Baldin,et al.  Enabling workflow repeatability with virtualization support , 2015, WORKS@SC.

[40]  Jens Krüger,et al.  Containerization of Galaxy Workflows increases Reproducibility , 2018 .

[41]  Paul Watson,et al.  CONCURRENCY AND COMPUTATION : PRACTICE AND EXPERIENCE Concurrency Computat , 2013 .

[42]  Frank Leymann,et al.  A Middleware-Centric Optimization Approach for the Automated Provisioning of Services in the Cloud , 2015, 2015 IEEE 7th International Conference on Cloud Computing Technology and Science (CloudCom).

[43]  Brett K. Beaulieu-Jones,et al.  Reproducibility of computational workflows is automated using continuous analysis , 2017, Nature Biotechnology.