Leveraging resource management for efficient performance of Apache Spark

Apache Spark is one of the most widely used open source processing framework for big data, it allows to process large datasets in parallel using a large number of nodes. Often, applications of this framework use resource management systems like YARN, which provide jobs a specific amount of resources for their execution. In addition, a distributed file system such as HDFS stores the data that is to be analyzed by the framework. This design allows sharing cluster resources effectively by running jobs on a single-node cluster or multi-nodes cluster infrastructure. Thus, one challenging issue is to realize effective resource management of these large cluster infrastructures in order to run distributed data analytics in an economically viable way. In this study, we use the Machine Learning library (MLlib) of Spark to implement different machine learning algorithms, then we manage the resources (CPU, memory, and Disk) in order to assess the performance of Apache Spark. In this paper, we present a review of various works that focus on resource management and data processing in Big Data platforms. Furthermore, we perform a scalability analysis using Spark. We analyze the speedup and processing time. We deduce that from a certain number of nodes in the cluster, it is no longer necessary to add additional nodes to improve the speedup and the processing Time. Then, we investigate the tuning of the resource allocation in Spark. We showed that it is not only by allocating all the available resources we get better performance but it depends on how to tune the resource allocation. We propose new managed parameters and we show that they give better total processing time than the default parameters used by Spark. Finally, we study the Persistence of Resilient Distributed Datasets (RDDs) in Spark using machine learning algorithms. We show that one storage level gives the best execution time among all tested storage levels.

[1]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[2]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[3]  Jia Guo,et al.  Smart-MLlib: A High-Performance Machine-Learning Library , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[4]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[5]  Patrick Wendell,et al.  Learning Spark: Lightning-Fast Big Data Analytics , 2015 .

[6]  Holden Karau,et al.  High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark , 2017 .

[7]  Ben He,et al.  A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[8]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[9]  Marcos Dias de Assunção,et al.  Apache Spark , 2019, Encyclopedia of Big Data Technologies.

[10]  Odej Kao,et al.  Adaptive Resource Management for Distributed Data Analytics based on Container-level Cluster Monitoring , 2017, DATA.

[11]  Lei Gu,et al.  Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[12]  Dilpreet Singh,et al.  A survey on platforms for big data analytics , 2014, Journal of Big Data.

[13]  Reynold Xin,et al.  Scaling Spark in the Real World: Performance and Usability , 2015, Proc. VLDB Endow..

[14]  Stan Matwin,et al.  Meta-MapReduce for scalable data mining , 2015, Journal of Big Data.

[15]  Guangchi Liu,et al.  Big data machine learning using apache spark MLlib , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[17]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[18]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[19]  Chen Wang,et al.  Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..

[20]  Yu Cao,et al.  HeteroSpark: A heterogeneous CPU/GPU Spark platform for machine learning algorithms , 2015, 2015 IEEE International Conference on Networking, Architecture and Storage (NAS).

[21]  K. Bakshi,et al.  Considerations for big data: Architecture and approach , 2012, 2012 IEEE Aerospace Conference.

[22]  Francisco Herrera,et al.  kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data , 2017, Knowl. Based Syst..

[23]  Scott Shenker,et al.  Fast and Interactive Analytics over Hadoop Data with Spark , 2012, login Usenix Mag..

[24]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[25]  Joshua Zhexue Huang,et al.  Big data analytics on Apache Spark , 2016, International Journal of Data Science and Analytics.

[26]  Christos Doulkeridis,et al.  A survey of large-scale analytical query processing in MapReduce , 2013, The VLDB Journal.

[27]  Reynold Xin,et al.  Apache Spark , 2016 .

[28]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[29]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[30]  Chunlin Li,et al.  Big Data Processing with Apache Spark in Tertiary Institutions: Spark Streaming , 2017 .

[31]  Mohak Shah,et al.  ADMM based scalable machine learning on Spark , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[32]  Mostafa Bellafkih,et al.  Big Data Optimisation Among RDDs Persistence in Apache Spark , 2018, BDCA.

[33]  Saeed Shahrivari,et al.  Beyond Batch Processing: Towards Real-Time and Streaming Big Data , 2014, Comput..

[34]  Mostafa Bellafkih,et al.  Big Data Processing using Machine Learning algorithms: MLlib and Mahout Use Case , 2018, SITA.

[35]  Cheng-Hao Tsai,et al.  Large-scale logistic regression and linear support vector machines using spark , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[36]  James G. Shanahan,et al.  Large Scale Distributed Data Science using Apache Spark , 2015, KDD.