Black or White? How to Develop an AutoTuner for Memory-based Analytics

There is a lot of interest today in building autonomous (or, self-driving) data processing systems. An emerging school of thought is to leverage AI-driven "black box" algorithms for this purpose. In this paper, we present a contrarian view. We study the problem of autotuning the memory allocation for applications running on modern distributed data processing systems. We show that an empirically-driven "white-box" algorithm, called RelM, that we have developed provides a close-to-optimal tuning at a fraction of the overheads compared to state-of-the-art AI-driven "black box" algorithms, namely, Bayesian Optimization (BO) and Deep Distributed Policy Gradient (DDPG). The main reason for RelM's superior performance is that the memory management in modern memory-based data analytics systems is an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by resource managers like Kubernetes and YARN, (ii) at the container level among the OS, pods, and processes such as the Java Virtual Machine (JVM), (iii) at the application level for caching, aggregation, data shuffles, and application data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation. RelM understands these interactions and uses them in building an analytical solution to autotune the memory management knobs. In another contribution, called Guided-BO (GBO), we use RelM's analytical models to speed up BO. Through an evaluation based on Apache Spark, we showcase that the RelM's recommendations are significantly better than what commonly-used Spark deployments provide, and are close to the ones obtained by brute-force exploration; while GBO provides optimality guarantees for a higher, but still significantly lower cost overhead compared to the state-of-the-art AI-driven policies.

[1]  Lesley Pugsley,et al.  How to ... , 2010, Education for primary care : an official publication of the Association of Course Organisers, National Association of GP Tutors, World Organisation of Family Doctors.

[2]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[3]  Shivnath Babu,et al.  Tuning Database Configuration Parameters with iTuned , 2009, Proc. VLDB Endow..

[4]  Zhen Cao,et al.  Towards Better Understanding of Black-box Auto-Tuning: A Comparative Analysis for Storage Systems , 2018, USENIX Annual Technical Conference.

[5]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[6]  J. Mockus Bayesian Approach to Global Optimization: Theory and Applications , 1989 .

[7]  Willy Zwaenepoel,et al.  Don't cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling , 2017, USENIX Annual Technical Conference.

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[10]  Tao Ye,et al.  A recursive random search algorithm for large-scale network parameter configuration , 2003, SIGMETRICS '03.

[11]  Bowei Xi,et al.  A smart hill-climbing algorithm for application server configuration , 2004, WWW '04.

[12]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[13]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[14]  A. K. Austin,et al.  Sharing a Cake , 1982, The Mathematical Gazette.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Kevin Wilkinson,et al.  Automated Performance Management for the Big Data Stack , 2019, CIDR.

[17]  Shivnath Babu,et al.  Thoth in Action: Memory Management in Modern Data Analytics , 2017, Proc. VLDB Endow..

[18]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[19]  Giuliano Casale,et al.  An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing Systems , 2016, 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS).

[20]  Charles Reiss,et al.  Understanding Memory Configurations for In-Memory Analytics , 2016 .

[21]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[22]  Guoliang Li,et al.  QTune: A Query-Aware Database Tuning System with Deep Reinforcement Learning , 2019, Proc. VLDB Endow..

[23]  Gerhard Weikum,et al.  Self-tuning Database Technology and Information Services: from Wishful Thinking to Viable Engineering , 2002, VLDB.

[24]  Ben He,et al.  A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[25]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[26]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[27]  Leonie Kohl,et al.  Fundamental Concepts in the Design of Experiments , 2000 .

[28]  Michael Stonebraker,et al.  Readings in Database Systems: Fourth Edition , 2005 .

[29]  Shivnath Babu,et al.  Thoth: Towards Managing a Multi-System Cluster , 2014, Proc. VLDB Endow..

[30]  Feng Zhu,et al.  Experience report: A characteristic study on out of memory errors in distributed data-parallel applications , 2015, 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE).

[31]  Li Zhang,et al.  MRONLINE: MapReduce online performance tuning , 2014, HPDC '14.

[32]  Carlo Curino,et al.  Reservation-based Scheduling: If You're Late Don't Blame Us! , 2014, SoCC.

[33]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[34]  Srikanth Kandula,et al.  Resource Management with Deep Reinforcement Learning , 2016, HotNets.

[35]  Graham Wood,et al.  Automatic Performance Diagnosis and Tuning in Oracle , 2005, CIDR.

[36]  Jordi Torres,et al.  A Methodology for Spark Parameter Tuning , 2017, Big Data Res..

[37]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[38]  Yuqing Zhu,et al.  BestConfig: tapping the performance potential of systems via automatic configuration tuning , 2017, SoCC.

[39]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[40]  Chen Wang,et al.  MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs , 2014, Proc. VLDB Endow..

[41]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[42]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[43]  C. Martin 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.

[44]  Xin Liu,et al.  Learning-based Automatic Parameter Tuning for Big Data Analytics Frameworks , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[45]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[46]  Shivnath Babu,et al.  Tempo: Robust and Self-Tuning Resource Management in Multi-tenant Parallel Databases , 2015, Proc. VLDB Endow..

[47]  Javier Jaen-Martinez The Java Management Extensions (JMX) , 2000 .

[48]  Tim Menzies,et al.  Arrow: Low-Level Augmented Bayesian Optimization for Finding the Best Cloud VM , 2017, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[49]  Ke Zhou,et al.  An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning , 2019, SIGMOD Conference.

[50]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[51]  Sam Lightstone,et al.  Automatic Database Configuration for DB2 Universal Database: Compressing Years of Performance Expertise into Seconds of Execution , 2003, BTW.

[52]  Sam Lightstone,et al.  Adaptive self-tuning memory in DB2 , 2006, VLDB.

[53]  Herodotos Herodotou,et al.  No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics , 2011, SoCC.

[54]  Kamesh Munagala,et al.  ROBUS: Fair Cache Allocation for Data-parallel Workloads , 2015, SIGMOD Conference.

[55]  Tim Kraska,et al.  Neo: A Learned Query Optimizer , 2019, Proc. VLDB Endow..

[56]  A. Azzouz 2011 , 2020, City.

[57]  Surajit Chaudhuri,et al.  Table of Contents (pdf) , 2007, VLDB.

[58]  Christos Faloutsos,et al.  Storage device performance prediction with CART models , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[59]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[60]  Valentin Dalibard,et al.  BOAT: Building Auto-Tuners with Structured Bayesian Optimization , 2017, WWW.