LOCAT: Low-Overhead Online Configuration Auto-Tuning of Spark SQL Applications

Spark SQL has been widely deployed in industry but it is challenging to tune its performance. Recent studies try to employ machine learning (ML) to solve this problem, but suffer from two drawbacks. First, it takes a long time (high overhead) to collect training samples. Second, the optimal configuration for one input data size of the same application might not be optimal for others. To address these issues, we propose a novel Bayesian Optimization (BO) based approach named LOCAT to automatically tune the configurations of Spark SQL applications online. LOCAT innovates three techniques. The first technique, named QCSA, eliminates the configuration-insensitive queries by Query Configuration Sensitivity Analysis (QCSA) when collecting training samples. The second technique, dubbed DAGP, is a Datasize-Aware Gaussian Process (DAGP) which models the performance of an application as a distribution of functions of configuration parameters as well as input data size. The third technique, called IICP, Identifies Important Configuration Parameters (IICP) with respect to performance and only tunes the important ones. As such, LOCAT can tune the configurations of a Spark SQL application with low overhead and adapt to different input data sizes We employ Spark SQL applications from benchmark suites TPC-DS, TPC-H, and HiBench running on two significantly different clusters, a four-node ARM cluster and an eight-node x86 cluster, to evaluate LOCAT. The experimental results on the ARM cluster show that LOCAT accelerates the optimization procedures of the state-of-the-art approaches by at least 4.1× and up to 9.7×; moreover, LOCAT improves the application performance by at least 1.9× and up to 2.4×. On the x86 cluster, LOCAT shows similar results to those on the ARM cluster.

[1]  Jiaheng Lu,et al.  A Survey on Automatic Parameter Tuning for Big Data Processing Systems , 2020, ACM Comput. Surv..

[2]  Carlos H. A. Costa,et al.  You Only Run Once: Spark Auto-Tuning From a Single Run , 2020, IEEE Transactions on Network and Service Management.

[3]  Muhammad Saddam Khokhar,et al.  Nonlinear dimensionality reduction in robot vision for industrial monitoring process via deep three dimensional Spearman correlation analysis (D3D-SCA) , 2020, Multimedia Tools and Applications.

[4]  Rida Qayyum A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Solution , 2020, International Journal of Education and Management Engineering.

[5]  Diyar Qader Zeebaree,et al.  A Comprehensive Review of Dimensionality Reduction Techniques for Feature Selection and Feature Extraction , 2020, Journal of Applied Science and Technology Trends.

[6]  M. Gribaudo,et al.  Predicting the performance of big data applications on the cloud , 2020, The Journal of Supercomputing.

[7]  Shivnath Babu,et al.  Black or White? How to Develop an AutoTuner for Memory-based Analytics , 2020, SIGMOD Conference.

[8]  Andrew Rice,et al.  Tuneful: An Online Significance-Aware Configuration Tuner for Big Data Analytics , 2020, ArXiv.

[9]  Chen Chen,et al.  Cost-effective Resource Provisioning for Spark Workloads , 2019, CIKM.

[10]  Guoliang Li,et al.  QTune: A Query-Aware Database Tuning System with Deep Reinforcement Learning , 2019, Proc. VLDB Endow..

[11]  Ke Zhou,et al.  An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning , 2019, SIGMOD Conference.

[12]  John Murphy,et al.  Multi-Layer-Mesh: A Novel Topology and SDN-Based Path Switching for Big Data Cluster Networks , 2019, ICC 2019 - 2019 IEEE International Conference on Communications (ICC).

[13]  Matteo Golfarelli,et al.  A Cost Model for SPARK SQL , 2019, IEEE Transactions on Knowledge and Data Engineering.

[14]  Danilo Ardagna,et al.  Scalable Performance Modeling and Evaluation of MapReduce Applications , 2019, Communications in Computer and Information Science.

[15]  David Phillips,et al.  Presto: SQL on Everything , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[16]  M. A. Hoque,et al.  d -Simplexed: Adaptive Delaunay Triangulation for Performance Modeling and Prediction on Big Data Analytics , 2019 .

[17]  Omar Boussaïd,et al.  Partitioning and Bucketing Techniques to Speed up Query Processing in Spark-SQL , 2018, 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS).

[18]  Bin Sun,et al.  CounterMiner: Mining Big Performance Data from Hardware Counters , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  John Murphy,et al.  BigDataNetSim: A Simulator for Data and Process Placement in Large Big Data Platforms , 2018, 2018 IEEE/ACM 22nd International Symposium on Distributed Simulation and Real Time Applications (DS-RT).

[20]  Xin Liu,et al.  Learning-based Automatic Parameter Tuning for Big Data Analytics Frameworks , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[21]  Haldun Akoglu,et al.  User's guide to correlation coefficients , 2018, Turkish journal of emergency medicine.

[22]  Takeshi Yoshimura,et al.  Towards Selecting Best Combination of SQL-on-Hadoop Systems and JVMs , 2018, 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).

[23]  Xuehai Qian,et al.  Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing , 2018, ASPLOS.

[24]  Tim Menzies,et al.  Arrow: Low-Level Augmented Bayesian Optimization for Finding the Best Cloud VM , 2017, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[25]  María S. Pérez-Hernández,et al.  Using machine learning to optimize parallelism in big data applications , 2017, Future Gener. Comput. Syst..

[26]  Vana Kalogeraki,et al.  Dione: Profiling spark applications exploiting graph similarity , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[27]  Yuqing Zhu,et al.  BestConfig: tapping the performance potential of systems via automatic configuration tuning , 2017, SoCC.

[28]  Angela P. Ansuj,et al.  Classification of the coefficient of variation to variables in beef cattle experiments , 2017 .

[29]  Rekha Singhal,et al.  Performance Assurance Model for Applications on SPARK Platform , 2017, TPCTC.

[30]  Jordi Torres,et al.  Dynamic Configuration of Partitioning in Spark Applications , 2017, IEEE Transactions on Parallel and Distributed Systems.

[31]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[32]  Ben He,et al.  A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[33]  Baijian Yang,et al.  Big Data Dimension Reduction Using PCA , 2016, 2016 IEEE International Conference on Smart Cloud (SmartCloud).

[34]  H. Peter Hofstee,et al.  Auto-tuning Spark big data workloads on POWER8: Prediction-based dynamic SMT threading , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[35]  Jordi Torres,et al.  Spark Parameter Tuning via Trial-and-Error , 2016, INNS Conference on Big Data.

[36]  Lieven Eeckhout,et al.  RFHOC: A Random-Forest Approach to Auto-Tuning Hadoop's Configuration , 2016, IEEE Transactions on Parallel and Distributed Systems.

[37]  Tamiya Onodera,et al.  Workload characterization and optimization of TPC-H queries on Apache Spark , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[38]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[39]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[40]  Todor Ivanov,et al.  Evaluating Hive and Spark SQL with BigBench , 2015, ArXiv.

[41]  Alfredo Cuzzocrea,et al.  Data warehousing and OLAP over Big Data: a survey of the state-of-the-art, open problems and future challenges , 2015, Int. J. Bus. Process. Integr. Manag..

[42]  Kewen Wang,et al.  Performance Prediction for Apache Spark Platform , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[43]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[44]  José Carlos Lorenzo,et al.  Coefficient of Variation Can Identify the Most Important Effects of Experimental Treatments , 2015 .

[45]  Huihong He,et al.  Olap query performance tuning in Spark , 2015 .

[46]  Jorge Bernardino,et al.  An Overview of Decision Support Benchmarks: TPC-DS, TPC-H and SSB , 2015, WorldCIST.

[47]  David Bernstein,et al.  Containers and Cloud: From LXC to Docker to Kubernetes , 2014, IEEE Cloud Computing.

[48]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[49]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[50]  Thomas Neumann,et al.  TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark , 2013, TPCTC.

[51]  Xiao Zhang,et al.  Learning KPCA for Face Recognition , 2013, ICIC.

[52]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[53]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[54]  Nando de Freitas,et al.  Portfolio Allocation for Bayesian Optimization , 2010, UAI.

[55]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[56]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[57]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[58]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[59]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[60]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[61]  J. H. Zar,et al.  Spearman Rank Correlation , 2005 .

[62]  I. Hațieganu,et al.  PEARSON VERSUS SPEARMAN, KENDALL'S TAU CORRELATION ANALYSIS ON STRUCTURE-ACTIVITY RELATIONSHIPS OF BIOLOGIC ACTIVE COMPOUNDS , 2005 .

[63]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[64]  Ajay K. Gupta,et al.  Latin hypercube sampling and the propagation of uncertainty in analyses of complex systems , 2002, Reliab. Eng. Syst. Saf..

[65]  Antonija Mitrovic,et al.  KERMIT: A Constraint-Based Tutor for Database Modeling , 2002, Intelligent Tutoring Systems.

[66]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[67]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[68]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[69]  J. Mockus Bayesian Approach to Global Optimization: Theory and Applications , 1989 .

[70]  Hans Marmolin,et al.  Subjective MSE Measures , 1986, IEEE Transactions on Systems, Man, and Cybernetics.

[71]  D. Kleinbaum,et al.  Applied Regression Analysis and Other Multivariate Methods , 1978 .