StreamFlow: cross-breeding cloud with HPC

Workflows are among the most commonly used tools in a variety of execution environments. Many of them target a specific environment; few of them make it possible to execute an entire workflow in different environments, e.g. Kubernetes and batch clusters. We present a novel approach to workflow execution, called StreamFlow, that complements the workflow graph with the declarative description of potentially complex execution environments, and that makes it possible the execution onto multiple sites not sharing a common data space. StreamFlow is then exemplified on a novel bioinformatics pipeline for single-cell transcriptomic data analysis workflow.

[1]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[2]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[3]  Marco Danelutto,et al.  Stkm on Sca: A Unified Framework with Components, Workflows and Algorithmic Skeletons , 2009, Euro-Par.

[4]  Miron Livny,et al.  The Evolution of the Pegasus Workflow Management Software , 2019, Computing in Science & Engineering.

[5]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[6]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[7]  Douglas Thain,et al.  Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids , 2012, SWEET '12.

[8]  Johan Montagnat,et al.  Scientific workflows: Past, present and future , 2017, Future Gener. Comput. Syst..

[9]  Mark Greenwood,et al.  Taverna: lessons in creating a workflow environment for the life sciences: Research Articles , 2006 .

[10]  Robert L. Henderson,et al.  Job Scheduling Under the Portable Batch System , 1995, JSSPP.

[11]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[12]  Stefano Lusso,et al.  OCCAM: a flexible, multi-purpose and extendable HPC cluster , 2017, ArXiv.

[13]  Sven Rahmann,et al.  Genome analysis , 2022 .

[14]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[15]  Malcolm P. Atkinson,et al.  Asterism: Pegasus and Dispel4py Hybrid Workflows for Data-Intensive Science , 2016, 2016 Seventh International Workshop on Data-Intensive Computing in the Clouds (DataCloud).

[16]  Emanuele Danovaro,et al.  HPC, Cloud and Big-Data Convergent Architectures: The LEXIS Approach , 2019, CISIS.

[17]  Jun Qin,et al.  ASKALON: A Development and Grid Computing Environment for Scientific Workflows , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[18]  Ajay Mohindra,et al.  Simplifying solution deployment on a Cloud through composite appliances , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[19]  Péter Kacsuk,et al.  Deploying Docker Swarm cluster on hybrid clouds using Occopus , 2018, Adv. Eng. Softw..

[20]  Didier Donsez,et al.  Roboconf: A Hybrid Cloud Orchestrator to Deploy Complex Applications , 2015, 2015 IEEE 8th International Conference on Cloud Computing.

[21]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[22]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[23]  Robert B. Ross,et al.  Supporting task-level fault-tolerance in HPC workflows by launching MPI jobs inside MPI jobs , 2017, WORKS@SC.

[24]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[25]  Guido Boella,et al.  HPC4AI: an AI-on-demand federated platform endeavour , 2018, CF.

[26]  Marta Mattoso,et al.  A Survey of Data-Intensive Scientific Workflow Management , 2015, Journal of Grid Computing.

[27]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[28]  David Bermbach,et al.  Requirements for an IaaS deployment language in federated Clouds , 2011, 2011 IEEE International Conference on Service-Oriented Computing and Applications (SOCA).

[29]  Ola Spjuth,et al.  Container-based bioinformatics with Pachyderm , 2018, bioRxiv.

[30]  Ivan Merelli,et al.  Precise Gene Editing Preserves Hematopoietic Stem Cell Function following Transient p53-Mediated DNA Damage Response , 2019, Cell stem cell.

[31]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[32]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[33]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[34]  John Chilton,et al.  The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update , 2016, Nucleic Acids Res..

[35]  John Chilton,et al.  Common Workflow Language, v1.0 , 2016 .

[36]  Jon Ander Gómez,et al.  Deep-Learning and HPC to Boost Biomedical Applications for Health (DeepHealth) , 2019, 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS).

[37]  Bertram Ludäscher,et al.  Scientific workflow management and the Kepler system: Research Articles , 2006 .

[38]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[39]  Philip J. Maechling,et al.  Enabling large-scale scientific workflows on petascale resources using MPI master/worker , 2012, XSEDE '12.

[40]  Christoph Hafemeister,et al.  Comprehensive integration of single cell data , 2018, bioRxiv.

[41]  Eduard Ayguadé,et al.  Workflows for Science: a Challenge when Facing the Convergence of HPC and Big Data , 2017, Supercomput. Front. Innov..

[42]  Ola Spjuth,et al.  Galaxy-Kubernetes integration: scaling bioinformatics workflows in the cloud , 2018, bioRxiv.

[43]  Simon Moser,et al.  Topology and Orchestration Specification for Cloud Applications Version 1.0 , 2013 .

[44]  Ian T. Foster,et al.  Language Features for Scalable Distributed-Memory Dataflow Computing , 2014, 2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[45]  Malcolm P. Atkinson,et al.  dispel4py: A Python framework for data-intensive scientific computing , 2014, 2014 International Workshop on Data Intensive Scalable Computing Systems.

[46]  Jan Martinovic,et al.  HyperLoom: A Platform for Defining and Executing Scientific Pipelines in Distributed Environments , 2018, PARMA-DITAM '18.

[47]  Wil M. P. van der Aalst,et al.  Workflow Patterns , 2003, Distributed and Parallel Databases.

[48]  Michael Kotliar,et al.  CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language , 2018 .

[49]  Domenico Talia,et al.  Enabling Cloud Interoperability with COMPSs , 2012, Euro-Par.

[50]  Marco Beccuti,et al.  Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines , 2018, BMC Bioinformatics.

[51]  Atul J. Butte,et al.  Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage , 2018, Nature Immunology.

[52]  Rizos Sakellariou,et al.  A characterization of workflow management systems for extreme-scale applications , 2016, Future Gener. Comput. Syst..

[53]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[54]  Alban Gaignard,et al.  Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities , 2017, Future Gener. Comput. Syst..

[55]  Ignacio Blanquer,et al.  A Platform to Deploy Customized Scientific Virtual Infrastructures on the Cloud , 2014, 2014 6th International Workshop on Science Gateways.

[56]  Duc-Hung Le,et al.  SALSA: A Framework for Dynamic Configuration of Cloud Services , 2014, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science.