Abstract Spark has gained growing attention in the past couple of years as an in-memory cloud computing platform. It supports execution of various types of workloads such as SQL queries and machine learning applications. Currently, many enterprises use Spark to exploit its fast in-memory processing of large scale data. Additionally, speeding up the execution in Spark is an important problem for many real-time applications. This can be achieved by improving the scheduling approaches employed by Spark, optimizing the execution plans generated by Spark for various applications, and selecting the best cluster configuration to run an input workload. A first step for all these optimization approaches is to predict the execution time of an input Spark application. In this paper, we present a new platform that predicts with high accuracy the execution time of SQL queries and machine learning applications executed by Spark. We evaluate our proposed platform by measuring the accuracy of predicting execution time of various types of Spark jobs including TPC-H queries and machine learning classification/clustering applications. The evaluation experiments show that we are able to predict the execution time of Spark jobs using our proposed platform with accuracy greater than 90% for SQL queries and greater than 75% for machine learning jobs.
[1]
J. Friedman.
Greedy function approximation: A gradient boosting machine.
,
2001
.
[2]
Ameet Talwalkar,et al.
MLlib: Machine Learning in Apache Spark
,
2015,
J. Mach. Learn. Res..
[3]
Daniel C. Zilio,et al.
Recommending XML physical designs for XML databases
,
2012,
The VLDB Journal.
[4]
Xiaodong Liu,et al.
Estimation Accuracy on Execution Time of Run-Time Tasks in a Heterogeneous Distributed Environment
,
2016,
Sensors.
[5]
Surajit Chaudhuri,et al.
Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques
,
2012,
Proc. VLDB Endow..
[6]
Herodotos Herodotou,et al.
Profiling, what-if analysis, and cost-based optimization of MapReduce programs
,
2011,
Proc. VLDB Endow..
[7]
Surajit Chaudhuri,et al.
AutoAdmin “what-if” index analysis utility
,
1998,
SIGMOD '98.
[8]
Peng Li,et al.
Performance Prediction of Spark Based on the Multiple Linear Regression Analysis
,
2017,
PAAP.