Otterman: A Novel Approach of Spark Auto-tuning by a Hybrid Strategy

Spark has become a very attractive platform for big data analytics in recent years due to its unique advantages such as parallelism, fault tolerance, and complexity associated with clusters setup. On the spark platform, users can adjust parameter configurations according to different job requirements and specific applications to optimize performance. This leads to a problem that we can’t ignore, Spark already has more than 180 parameters, and its huge combination of parameters means that we can’t rely on manual tuning to grasp the impact of all parameters on performance. In order to solve the problem of relying heavily on expert experience and manual operation, we propose Otterman, a parameters optimization approach based on the combination of Simulated Annealing algorithm and Least Squares method, which can help us dynamically adjust parameters according to job types to obtain optimal configuration to improve performance. Simulated Annealing can find the optimal solution, but has poor convergence. We make use of the Least Squares method to effectively improve the speed at which the former converges to the optimal solution. Otterman is simple and easy to perform, with no additional cost. The effectiveness of the approach is verified by experiments, the results show that Otterman’s average performance has increased by 30% compared to the default parameters configuration, with an accuracy of about 68%.

[1]  Helen D. Karatza,et al.  Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark , 2017, J. Syst. Softw..

[2]  Jordi Torres,et al.  A Methodology for Spark Parameter Tuning , 2017, Big Data Res..

[3]  Alfonsas Misevicius,et al.  A Modified Simulated Annealing Algorithm for the Quadratic Assignment Problem , 2003, Informatica.

[4]  Juan Lin,et al.  List-Based Simulated Annealing Algorithm for Traveling Salesman Problem , 2016, Comput. Intell. Neurosci..

[5]  Aniruddha S. Gokhale,et al.  A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration , 2013, 20th Annual International Conference on High Performance Computing.

[6]  Yong Wang,et al.  An Improved Simulated Annealing Algorithm for Traveling Salesman Problem , 2013 .

[7]  Jordi Torres,et al.  Spark Parameter Tuning via Trial-and-Error , 2016, INNS Conference on Big Data.

[8]  Zhong Chen,et al.  A speculative parallel decompression algorithm on Apache Spark , 2017, The Journal of Supercomputing.

[9]  Zhoukai Wang,et al.  A speculative parallel simulated annealing algorithm based on Apache Spark , 2018, Concurr. Comput. Pract. Exp..

[10]  Hui Zhang,et al.  Solving travelling salesman problem using multiagent simulated annealing algorithm with instance-based sampling , 2015, Int. J. Comput. Sci. Math..

[11]  Chen Wang,et al.  Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..

[12]  Vladimir Vlassov,et al.  How Data Volume Affects Spark Based Data Analytics on a Scale-up Server , 2015, BPOE.

[13]  Holden Karau,et al.  Learning Spark - lightning-fast data analysis, 1st Edition , 2015 .

[14]  Palden Lama,et al.  AROMA: automated resource allocation and configuration of mapreduce environment in the cloud , 2012, ICAC '12.

[15]  Kewen Wang,et al.  Performance Prediction for Apache Spark Platform , 2015, 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems.

[16]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[17]  Ben He,et al.  A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[18]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[19]  Prashant Johri,et al.  Privacy Preserve Hadoop (PPH)—An Implementation of BIG DATA Security by Hadoop with Encrypted HDFS , 2018 .

[20]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[21]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[22]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.