Automatic Database Management System Tuning Through Large-scale Machine Learning

Database management system (DBMS) configuration tuning is an essential aspect of any data-intensive application effort. But this is historically a difficult task because DBMSs have hundreds of configuration "knobs" that control everything in the system, such as the amount of memory to use for caches and how often data is written to storage. The problem with these knobs is that they are not standardized (i.e., two DBMSs use a different name for the same knob), not independent (i.e., changing one knob can impact others), and not universal (i.e., what works for one application may be sub-optimal for another). Worse, information about the effects of the knobs typically comes only from (expensive) experience. To overcome these challenges, we present an automated approach that leverages past experience and collects new information to tune DBMS configurations: we use a combination of supervised and unsupervised machine learning methods to (1) select the most impactful knobs, (2) map unseen database workloads to previous workloads from which we can transfer experience, and (3) recommend knob settings. We implemented our techniques in a new tool called OtterTune and tested it on two DBMSs. Our evaluation shows that OtterTune recommends configurations that are as good as or better than ones generated by existing tools or a human expert.

[1]  Shivnath Babu,et al.  Tuning Database Configuration Parameters with iTuned , 2009, Proc. VLDB Endow..

[2]  Stanley B. Zdonik,et al.  On Predictive Modeling for Optimizing Transaction Execution in Parallel OLTP Systems , 2011, Proc. VLDB Endow..

[3]  Miron Livny,et al.  Goal-oriented buffer management revisited , 1996, SIGMOD '96.

[4]  Carlo Curino,et al.  Schism , 2010, Proc. VLDB Endow..

[5]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[6]  Mohamed F. Mokbel,et al.  SARD: A statistical approach for ranking database tuning parameters , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[7]  Anastasia Ailamaki,et al.  Continuous resource monitoring for self-predicting DBMS , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[8]  Surajit Chaudhuri,et al.  AutoAdmin “what-if” index analysis utility , 1998, SIGMOD '98.

[9]  Eva Kwan Automatic Configuration for IBM ® DB2 Universal , 2002 .

[10]  Michael Stonebraker,et al.  The Asilomar report on database research , 1998, SGMD.

[11]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[12]  Laurent Perron,et al.  Structured vs. Unstructured Large Neighborhood Search: A Case Study on Job-Shop Scheduling Problems with Earliness and Tardiness Costs , 2003, CP.

[13]  Bowei Xi,et al.  A smart hill-climbing algorithm for application server configuration , 2004, WWW '04.

[14]  Andrea Lodi,et al.  Local Search and Constraint Programming , 2003, Handbook of Metaheuristics.

[15]  D. Clayton,et al.  Genome-wide association study and meta-analysis finds over 40 loci affect risk of type 1 diabetes , 2009, Nature Genetics.

[16]  Daniel C. Zilio,et al.  DB2 advisor: an optimizer smart enough to recommend its own indexes , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[17]  Daniel C. Zilio,et al.  Physical database design decision algorithms and concurrent reorganization for parallel database systems , 1998 .

[18]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[19]  François Laburthe,et al.  LS and CP illustrated on a transportation problem , 2004 .

[20]  Benoît Dageville,et al.  Oracle's SQL Performance Analyzer , 2008, IEEE Data Eng. Bull..

[21]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[22]  Christopher Ré,et al.  Materialization optimizations for feature selection workloads , 2014, SIGMOD Conference.

[23]  Ashraf Aboulnaga,et al.  Automatic virtual machine configuration for database workloads , 2008, SIGMOD Conference.

[24]  Carlo Curino,et al.  OLTP-Bench: An Extensible Testbed for Benchmarking Relational Databases , 2013, Proc. VLDB Endow..

[25]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[26]  Patrick Martin,et al.  Techniques for automatically sizing multiple buffer pools in DB2 , 2003, CASCON.

[27]  Benoît Dageville,et al.  Self-Tuning for SQL Performance in Oracle Database 11g , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[28]  Thomas Neumann,et al.  TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark , 2013, TPCTC.

[29]  Anthony K. H. Tung,et al.  A new approach to dynamic self-tuning of database buffers , 2008, TOS.

[30]  Margo I. Seltzer,et al.  Using probabilistic reasoning to automate software tuning , 2004, SIGMETRICS '04/Performance '04.

[31]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[32]  Sam Lightstone,et al.  Automatic Database Configuration for DB2 Universal Database: Compressing Years of Performance Expertise into Seconds of Execution , 2003, BTW.

[33]  Sam Lightstone,et al.  Adaptive self-tuning memory in DB2 , 2006, VLDB.

[34]  Surajit Chaudhuri,et al.  Table of Contents (pdf) , 2007, VLDB.

[35]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[36]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[37]  R. Tibshirani,et al.  Uniform asymptotic inference and the bootstrap after model selection , 2015, The Annals of Statistics.

[38]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[39]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[40]  Michael Stonebraker,et al.  Intel "big data" science and technology center vision and execution plan , 2013, SGMD.

[41]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[42]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[43]  Graham Wood,et al.  Automatic Performance Diagnosis and Tuning in Oracle , 2005, CIDR.

[44]  Barzan Mozafari,et al.  DBSherlock: A Performance Diagnostic Tool for Transactional Databases , 2016, SIGMOD Conference.

[45]  WeikumGerhard,et al.  The COMFORT automatic tuning project , 1994 .

[46]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[47]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[48]  D. Pham,et al.  Selection of K in K-means clustering , 2005 .

[49]  Jeffrey D. Ullman,et al.  Index selection for OLAP , 1997, Proceedings 13th International Conference on Data Engineering.