Spark-based Cloud Data Analytics using Multi-Objective Optimization

Data analytics in the cloud has become an integral part of enterprise businesses. Big data analytics systems, however, still lack the ability to take task objectives such as user performance goals and budgetary constraints and automatically configure an analytic job to achieve these objectives. This paper presents UDAO, a Spark-based Unified Data Analytics Optimizer that can automatically determine a cluster configuration with a suitable number of cores as well as other system parameters that best meet the task objectives. At a core of our work is a principled multi-objective optimization (MOO) approach that computes a Pareto optimal set of configurations to reveal tradeoffs between different objectives, recommends a new Spark configuration that best explores such tradeoffs, and employs novel optimizations to enable such recommendations within a few seconds. Detailed experiments using benchmark workloads show that our MOO techniques provide a 2-50x speedup over existing MOO methods, while offering good coverage of the Pareto frontier. Compared to Ottertune, a state-of-the-art performance tuning system, UDAO recommends Spark configurations that yield 26%-49% reduction of running time of the TPCx-BB benchmark while adapting to different user preferences on multiple objectives.

[1]  Yanlei Diao,et al.  UDAO: A Next-Generation Unified Data Analytics Optimizer , 2019, Proc. VLDB Endow..

[2]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[3]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[4]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[5]  Olga Papaemmanouil,et al.  WiSeDB: A Learning-based Workload Management Advisor for Cloud Databases , 2016, Proc. VLDB Endow..

[6]  Christoph Koch,et al.  An Incremental Anytime Algorithm for Multi-Objective Query Optimization , 2015, SIGMOD Conference.

[7]  Prashant J. Shenoy,et al.  Supporting Scalable Analytics with Latency Constraints , 2015, Proc. VLDB Endow..

[8]  Alekh Jindal,et al.  Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings , 2020, SIGMOD Conference.

[9]  Maximilian Balandat,et al.  Differentiable Expected Hypervolume Improvement for Parallel Multi-Objective Bayesian Optimization , 2020, NeurIPS.

[10]  Leo Liberti,et al.  Undecidability and hardness in mixed-integer nonlinear programming , 2019, RAIRO Oper. Res..

[11]  Shivnath Babu,et al.  Tuning Database Configuration Parameters with iTuned , 2009, Proc. VLDB Endow..

[12]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[13]  Tim Kraska,et al.  Neo: A Learned Query Optimizer , 2019, Proc. VLDB Endow..

[14]  Carlo Curino,et al.  PerfOrator: eloquent performance models for Resource Optimization , 2016, SoCC.

[15]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[16]  Achille Messac,et al.  From Dubious Construction of Objective Functions to the Application of Physical Programming , 2000 .

[17]  Shivnath Babu,et al.  Tempo: Robust and Self-Tuning Resource Management in Multi-tenant Parallel Databases , 2015, Proc. VLDB Endow..

[18]  Ke Zhou,et al.  An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning , 2019, SIGMOD Conference.

[19]  Michael T. M. Emmerich,et al.  A tutorial on multiobjective optimization: fundamentals and evolutionary methods , 2018, Natural Computing.

[20]  Yannis E. Ioannidis,et al.  Schedule optimization for data processing flows on the cloud , 2011, SIGMOD '11.

[21]  Yuqing Zhu,et al.  BestConfig: tapping the performance potential of systems via automatic configuration tuning , 2017, SoCC.

[22]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[23]  Jasbir S. Arora,et al.  Survey of multi-objective optimization methods for engineering , 2004 .

[24]  Yuqing Zhu,et al.  ClassyTune: A Performance Auto-Tuner for Systems in the Cloud , 2019, IEEE Transactions on Cloud Computing.

[25]  Christoph Koch,et al.  Approximation schemes for many-objective query optimization , 2014, SIGMOD Conference.

[26]  Guoliang Li,et al.  An End-to-End Learning-based Cost Estimator , 2019, Proc. VLDB Endow..

[27]  Andreas Krause,et al.  A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions , 2016, bioRxiv.

[28]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[29]  Daniel Hern'andez-Lobato,et al.  Predictive Entropy Search for Multi-objective Bayesian Optimization with Constraints , 2016, Neurocomputing.

[30]  Yanlei Diao,et al.  Boosting Cloud Data Analytics using Multi-Objective Optimization , 2020, ArXiv.

[31]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[32]  A. Messac,et al.  The normalized normal constraint method for generating the Pareto frontier , 2003 .

[33]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.