To Tune or Not to Tune?: In Search of Optimal Configurations for Data Analytics

This experimental study presents a number of issues that pose a challenge for practical configuration tuning and its deployment in data analytics frameworks. These issues include: 1) the assumption of a static workload or environment, ignoring the dynamic characteristics of the analytics environment (e.g., increase in input data size, changes in allocation of resources). 2) the amortization of tuning costs and how this influences what workloads can be tuned in practice in a cost-effective manner. 3) the need for a comprehensive incremental tuning solution for a diverse set of workloads. We adapt different ML techniques in order to obtain efficient incremental tuning in our problem domain, and propose Tuneful, a configuration tuning framework. We show how it is designed to overcome the above issues and illustrate its applicability by running a wide array of experiments in cloud environments provided by two different service providers.

[1]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[2]  Marc Cohen,et al.  Google Compute Engine , 2014 .

[3]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[4]  Yuqing Zhu,et al.  BestConfig: tapping the performance potential of systems via automatic configuration tuning , 2017, SoCC.

[5]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[6]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[7]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[8]  Jasper Snoek,et al.  Multi-Task Bayesian Optimization , 2013, NIPS.

[9]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[10]  Xuehai Qian,et al.  Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing , 2018, ASPLOS.

[11]  Andrew Rice,et al.  Tuneful: An Online Significance-Aware Configuration Tuner for Big Data Analytics , 2020, ArXiv.

[12]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[13]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[14]  I. Sobol,et al.  On quasi-Monte Carlo integrations , 1998 .

[15]  Seyoon Ko,et al.  Processing large-scale data with Apache Spark , 2016 .

[16]  Kushal Datta,et al.  Gunther: Search-Based Auto-Tuning of MapReduce , 2013, Euro-Par.

[17]  Carl E. Rasmussen,et al.  Gaussian Processes for Data-Efficient Learning in Robotics and Control , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  John A. Weymark,et al.  GENERALIZED GIN 1 INEQUALITY INDICES , 2001 .

[19]  Emanuele Borgonovo,et al.  Sensitivity analysis: A review of recent advances , 2016, Eur. J. Oper. Res..

[20]  Shivnath Babu,et al.  Tuning Database Configuration Parameters with iTuned , 2009, Proc. VLDB Endow..

[21]  Saurabh Bagchi,et al.  Rafiki: a middleware for parameter tuning of NoSQL datastores for dynamic metagenomics workloads , 2017, Middleware.

[22]  Palden Lama,et al.  AROMA: automated resource allocation and configuration of mapreduce environment in the cloud , 2012, ICAC '12.

[23]  Ben He,et al.  A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).