Gray-Box Models for Performance Assessment of Spark Applications

Big data applications are among the most suitable applications to be executed on cluster resources because of their high requirements of computational power and data storage. Correctly sizing the resources devoted to their execution does not guarantee they will be executed as expected. Nevertheless, their execution can be affected by perturbations which can change the expected execution time. Identifying when these types of issue occurred by comparing their actual execution time with the expected one is mandatory to identify potentially critical situations and to take the appropriate steps to prevent them. To fulfill this objective, accurate estimates are necessary. In this paper, machine learning techniques coupled with a posteriori knowledge are exploited to build performance estimation models. Experimental results show how the models built with the proposed approach are able to outperform a reference state-of-the-art method (i.e., Ernest method), reducing in some scenarios the error from the 221.09-167.07% to 13.15-30.58%.

[1]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[2]  Asser N. Tantawi,et al.  Approximate Analysis of Fork/Join Synchronization in Parallel Queues , 1988, IEEE Trans. Computers.

[3]  Paolo Romano,et al.  Using Analytical Models to Bootstrap Machine Learning Performance Predictors , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).

[4]  Joseph Gonzalez,et al.  Hemingway: Modeling Distributed Optimization Algorithms , 2017, ArXiv.

[5]  Edward D. Lazowska,et al.  Quantitative System Performance , 1985, Int. CMG Conference.

[6]  Mohamed A. Ismail,et al.  A Machine Learning Approach for Predicting Execution Time of Spark Jobs , 2018 .

[7]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[8]  Gabriela Csurka,et al.  Domain Adaptation for Visual Applications: A Comprehensive Survey , 2017, ArXiv.

[9]  Satish K. Tripathi,et al.  On Performance Prediction of Parallel Computations with Precedent Constraints , 2000, IEEE Trans. Parallel Distributed Syst..

[10]  Eugenio Gianniti,et al.  A Combined Analytical Modeling Machine Learning Approach for Performance Prediction of MapReduce Jobs in Cloud Environment , 2016, 2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC).

[11]  Giuseppe Serazzi,et al.  JMT: performance engineering tools for system modeling , 2009, PERV.

[12]  Stephen F. Lundstrom,et al.  Predicting Performance of Parallel Computations , 1990, IEEE Trans. Parallel Distributed Syst..

[13]  Ana Paula Couto da Silva,et al.  Performance Prediction of Cloud-Based Big Data Applications , 2018, ICPE.

[14]  Kewen Wang,et al.  Performance Prediction for Apache Spark Platform , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[15]  Gene H. Golub,et al.  Matrix computations , 1983 .

[16]  Nhan Nguyen,et al.  Towards Automatic Tuning of Apache Spark Configuration , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).