Exploiting Machine Learning for Improving In-Memory Execution of Data-Intensive Workflows on Parallel Machines

Workflows are largely used to orchestrate complex sets of operations required to handle and process huge amounts of data. Parallel processing is often vital to reduce execution time when complex data-intensive workflows must be run efficiently, and at the same time, in-memory processing can bring important benefits to accelerate execution. However, optimization techniques are necessary to fully exploit in-memory processing, avoiding performance drops due to memory saturation events. This paper proposed a novel solution, called the Intelligent In-memory Workflow Manager (IIWM), for optimizing the in-memory execution of data-intensive workflows on parallel machines. IIWM is based on two complementary strategies: (1) a machine learning strategy for predicting the memory occupancy and execution time of workflow tasks; (2) a scheduling strategy that allocates tasks to a computing node, taking into account the (predicted) memory occupancy and execution time of each task and the memory available on that node. The effectiveness of the machine learning-based predictor and the scheduling strategy were demonstrated experimentally using as a testbed, Spark, a high-performance Big Data processing framework that exploits in-memory computing to speed up the execution of large-scale applications. In particular, two synthetic workflows were prepared for testing the robustness of the IIWM in scenarios characterized by a high level of parallelism and a limited amount of memory reserved for execution. Furthermore, a real data analysis workflow was used as a case study, for better assessing the benefits of the proposed approach. Thanks to high accuracy in predicting resources used at runtime, the IIWM was able to avoid disk writes caused by memory saturation, outperforming a traditional strategy in which only dependencies among tasks are taken into account. Specifically, the IIWM achieved up to a 31% and a 40% reduction of makespan and a performance improvement up to 1.45× and 1.66× on the synthetic workflows and the real case study, respectively.

[1]  Esther Pacitti,et al.  Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments , 2019, Data-Intensive Workflow Management.

[2]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[3]  Ana Paula Couto da Silva,et al.  Machine Learning for Performance Prediction of Spark Cloud Applications , 2019, 2019 IEEE 12th International Conference on Cloud Computing (CLOUD).

[4]  P. Herbert Raj,et al.  Load Balancing in Mobile Cloud Computing Using Bin Packing’s First Fit Decreasing Method , 2018 .

[5]  Li Yang,et al.  Dynamic memory-aware scheduling in spark computing environment , 2020, J. Parallel Distributed Comput..

[6]  Joo Young Hwang,et al.  Jointly optimizing task granularity and concurrency for in-memory mapreduce frameworks , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[7]  Claude Tadonki,et al.  Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks , 2018, Concurr. Comput. Pract. Exp..

[8]  Marta Mattoso,et al.  A Survey of Data-Intensive Scientific Workflow Management , 2015, Journal of Grid Computing.

[9]  Sucha Smanchat,et al.  Taxonomies of workflow scheduling problem and techniques in the cloud , 2015, Future Gener. Comput. Syst..

[10]  Edward G. Coffman,et al.  An Application of Bin-Packing to Multiprocessor Scheduling , 1978, SIAM J. Comput..

[11]  Nelson Luis Saldanha da Fonseca,et al.  Scheduling in hybrid clouds , 2012, IEEE Communications Magazine.

[12]  Yu Zhuang,et al.  A Machine Learning-Based Security Vulnerability Study on XOR PUFs for Resource-Constraint Internet of Things , 2018, 2018 IEEE International Congress on Internet of Things (ICIOT).

[13]  Domenico Talia,et al.  A Workflow Management System for Scalable Data Mining on Clouds , 2018, IEEE Transactions on Services Computing.

[14]  Jesús Carretero,et al.  A data‐aware scheduling strategy for workflow execution in clouds , 2017, Concurr. Comput. Pract. Exp..

[15]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[16]  Domenico Talia,et al.  Workflow Systems for Science: Concepts and Tools , 2013 .

[17]  Ankush Verma,et al.  Big data management processing with Hadoop MapReduce and spark technology: A comparison , 2016, 2016 Symposium on Colossal Data Analysis and Networking (CDAN).

[18]  Benjamin C. Lee,et al.  Cooper: Task Colocation with Cooperative Games , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19]  Li Zhang,et al.  SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[20]  Dana Petcu,et al.  Exascale Machines Require New Programming Paradigms and Runtimes , 2015, Supercomput. Front. Innov..

[21]  Bandar Aldawsari,et al.  Cloud-SEnergy: A bin-packing based multi-cloud service broker for energy efficient composition and execution of data-intensive applications , 2018, Sustain. Comput. Informatics Syst..

[22]  Feng Luo,et al.  Dynamic Management of In-Memory Storage for Efficiently Integrating Compute-and Data-Intensive Computing on HPC Systems , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[23]  Domenico Talia,et al.  Data Analysis in the Cloud , 2015 .

[24]  Barry Porter,et al.  Improving Spark Application Throughput Via Memory Aware Task Co-location: A Mixture of Experts Approach , 2017 .

[25]  Helen D. Karatza,et al.  Scheduling real-time DAGs in heterogeneous clusters by combining imprecise computations and bin packing techniques for the exploitation of schedule holes , 2012, Future Gener. Comput. Syst..

[26]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[27]  Yao Zhao,et al.  An adaptive memory tuning strategy with high performance for Spark , 2017, Int. J. Big Data Intell..

[28]  Yao Zhao,et al.  An adaptive tuning strategy on spark based on in-memory computation characteristics , 2016, 2016 18th International Conference on Advanced Communication Technology (ICACT).

[29]  Yoga Jaideep Darapuneni A Survey of Classical and Recent Results in Bin Packing Problem , 2012 .