Exploration of Workflow Management Systems Emerging Features from Users Perspectives

There has been a recent emergence of new workflow applications focused on data analytics and machine learning. This emergence has precipitated a change in the workflow management landscape, causing the development of new dataoriented workflow management systems (WMSs) in addition to the earlier standard of task-oriented WMSs. In this paper, we summarize three general workflow use-cases and explore the unique requirements of each use-case in order to understand how WMSs from both workflow management models meet the requirements of each workflow use-case from the user’s perspective. We analyze the applicability of the two models by carefully describing each model and by providing an examination of the different variations of WMSs that fall under the task driven model. To illustrate the strengths and weaknesses of each workflow management model, we summarize the key features of four production-ready WMSs: Pegasus, Makeflow, Apache Airflow, and Pachyderm. To deepen our analysis of the four WMSs examined in this paper,we implement three real-world use-cases to highlight the specifications and features of each WMS. We present our final assessment of each WMS after considering the following factors: usability, performance, ease of deployment, and relevance. The purpose of this work is to offer insights from the user’s perspective into the research challenges that WMSs currently face due to the evolving workflow landscape.

[1]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[2]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[3]  Anura P. Jayasumana,et al.  Radar networking in Collaborative Adaptive Sensing of Atmosphere: State of the art and research challenges , 2012, 2012 IEEE Globecom Workshops.

[4]  Malcolm P. Atkinson,et al.  Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows , 2019, Future Gener. Comput. Syst..

[5]  Paul Rad,et al.  Chameleon: A Scalable Production Testbed for Computer Science Research , 2019, Contemporary High Performance Computing.

[6]  Lavanya Ramakrishnan,et al.  The future of scientific workflows , 2018, Int. J. High Perform. Comput. Appl..

[7]  Jano I. van Hemert,et al.  Scientific Workflow: A Survey and Research Directions , 2007, PPAM.

[8]  Ravi Sethi,et al.  Scheduling Graphs on Two Processors , 1976, SIAM J. Comput..

[9]  E.L. Lawler,et al.  Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey , 1977 .

[10]  Jano I. van Hemert,et al.  Scientific Workflows , 2016, ACM Comput. Surv..

[11]  David Bernstein,et al.  Containers and Cloud: From LXC to Docker to Kubernetes , 2014, IEEE Cloud Computing.

[12]  Katy Börner,et al.  Comparing the Consumption of CPU Hours with Scientific Output for the Extreme Science and Engineering Discovery Environment (XSEDE) , 2016, PloS one.

[13]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[14]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[15]  濱野 純 入門Git : The fast version control system , 2009 .

[16]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[17]  Rajiv Ranjan,et al.  Osmotic Flow: Osmotic Computing + IoT Workflow , 2017, IEEE Cloud Computing.

[18]  Daniel S. Katz,et al.  Parsl: Scalable Parallel Scripting in Python , 2018, IWSG.

[19]  Jordan Matelsky,et al.  Toward A Reproducible, Scalable Framework for Processing Large Neuroimaging Datasets , 2019, bioRxiv.

[20]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[21]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[22]  Douglas Thain,et al.  Deploying High Throughput Scientific Workflows on Container Schedulers with Makeflow and Mesos , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[23]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[24]  Sam Shah,et al.  The big data ecosystem at LinkedIn , 2013, SIGMOD '13.

[25]  Ivan Rodero,et al.  Toward a Dynamic Network-Centric Distributed Cloud Platform for Scientific Workflows: A Case Study for Adaptive Weather Sensing , 2019, 2019 15th International Conference on eScience (eScience).

[26]  Andreas Neumann,et al.  Oozie: towards a scalable workflow management system for Hadoop , 2012, SWEET '12.

[27]  Marta Mattoso,et al.  A Survey of Data-Intensive Scientific Workflow Management , 2015, Journal of Grid Computing.

[28]  Ola Spjuth,et al.  Container-based bioinformatics with Pachyderm , 2018, bioRxiv.

[29]  John D. Leidel,et al.  Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity , 2018 .

[30]  Aske Plaat,et al.  Fast and Reproducible LOFAR Workflows with AGLOW , 2018, 2018 IEEE 14th International Conference on e-Science (e-Science).

[31]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[32]  Michael Kotliar,et al.  CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language , 2018 .

[33]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.

[34]  Nathalie Furmento,et al.  ICENI Dataflow and Workflow: Composition and Scheduling in Space and Time , 2003 .

[35]  Douglas Thain,et al.  Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids , 2012, SWEET '12.