Scalable Execution of Big Data Workflows using Software Containers

Big Data processing involves handling large and complex data sets, incorporating different tools and frameworks as well as other processes that help organisations make sense of their data collected from various sources. This set of operations, referred to as Big Data workflows, require taking advantage of the elasticity of cloud infrastructures for scalability. In this paper, we present the design and prototype implementation of a Big Data workflow approach based on the use of software container technologies and message-oriented middleware (MOM) to enable highly scalable workflow execution. The approach is demonstrated in a use case together with a set of experiments that demonstrate the practical applicability of the proposed approach for the scalable execution of Big Data workflows. Furthermore, we present a scalability comparison of our proposed approach with that of Argo Workflows - one of the most prominent tools in the area of Big Data workflows.

[1]  Sara Migliorini,et al.  Pattern-Based Evaluation of Scientific Workflow Management Systems , 2011 .

[2]  Paul Watson,et al.  Dynamic Deployment of Scientific Workflows in the Cloud Using Container Virtualization , 2016, 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).

[3]  Edward Curry,et al.  Message‐Oriented Middleware , 2005 .

[4]  Albert Y. Zomaya,et al.  Orchestrating Big Data Analysis Workflows in the Cloud , 2019, ACM Comput. Surv..

[5]  Nitin Naik Docker container-based big data processing system in multiple clouds for everyone , 2017, 2017 IEEE International Systems Engineering Symposium (ISSE).

[6]  Shiyong Lu,et al.  Big Data Workflows: A Reference Architecture and the DATAVIEW System , 2017 .

[7]  Christian Claus Wiechmann,et al.  Increasing the Throughput of Pipe-and-Filter Architectures by Integrating the Task Farm Parallelization Pattern , 2016, 2016 19th International ACM SIGSOFT Symposium on Component-Based Software Engineering (CBSE).

[8]  Douglas Thain,et al.  Integrating Containers into Workflows: A Case Study Using Makeflow, Work Queue, and Docker , 2015, VTDC@HPDC.

[9]  M Mernik,et al.  When and how to develop domain-specific languages , 2005, CSUR.

[10]  Andreas Wilke,et al.  Skyport - Container-Based Execution Environment Management for Multi-cloud Scientific Workflows , 2014, 2014 5th International Workshop on Data-Intensive Computing in the Clouds.

[11]  Wil M. P. van der Aalst,et al.  Workflow Data Patterns: Identification, Representation and Tool Support , 2005, ER.

[12]  Arvind,et al.  Tagged token dataflow architecture , 1983 .

[13]  Ellis Solaiman,et al.  Orchestrating BigData Analysis Workflows , 2017, IEEE Cloud Computing.