Towards a Learning Optimizer for Shared Clouds

Query optimizers are notorious for inaccurate cost estimates, leading to poor performance. The root of the problem lies in inaccurate cardinality estimates, i.e., the size of intermediate (and final) results in a query plan. These estimates also determine the resources consumed in modern shared cloud infrastructures. In this paper, we present C ARD L EARNER , a machine learning based approach to learn cardinality models from previous job executions and use them to predict the cardinalities in future jobs. The key intuition in our approach is that shared cloud workloads are often recurring and overlapping in nature, and so we could learn cardinality models for overlapping subgraph templates. We discuss various learning approaches and show how learning a large number of smaller models results in high accuracy and explainability. We further present an exploration technique to avoid learning bias by considering alternate join orders and learning cardinality models over them. We describe the feedback loop to apply the learned models back to future job executions. Finally, we show a detailed evaluation of our models (up to 5 orders of magnitude less error), query plans (60% applicability), performance (up to 100% faster, 3x fewer resources), and exploration (optimal in few 10s of executions).

[1]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[2]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[3]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[4]  Patricia G. Selinger,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[5]  Paolo Toth,et al.  Knapsack Problems: Algorithms and Computer Implementations , 1990 .

[6]  Stavros Christodoulakis,et al.  On the propagation of errors in the size of join results , 1991, SIGMOD '91.

[7]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[8]  G. Graefe The Cascades Framework for Query Optimization , 1995, IEEE Data Eng. Bull..

[9]  E. Mulvey,et al.  Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. , 1995, Psychological bulletin.

[10]  Alon Y. Halevy,et al.  An adaptive query execution system for data integration , 1999, SIGMOD '99.

[11]  Volker Markl,et al.  LEO - DB2's LEarning Optimizer , 2001, VLDB.

[12]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[13]  Hamid Pirahesh,et al.  Robust query processing through progressive optimization , 2004, SIGMOD '04.

[14]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[15]  Surajit Chaudhuri,et al.  A pay-as-you-go framework for query execution feedback , 2008, Proc. VLDB Endow..

[16]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[17]  Mohamed A. Soliman,et al.  Testing the accuracy of query optimizers , 2012, DBTest '12.

[18]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[19]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[20]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[21]  Nicolas Bruno,et al.  Continuous Cloud-Scale Query Optimization and Processing , 2013, Proc. VLDB Endow..

[22]  Andrey Balmin,et al.  Dynamically optimizing queries over large scale data platforms , 2014, SIGMOD Conference.

[23]  Liwen Sun,et al.  Fine-grained partitioning for aggressive data skipping , 2014, SIGMOD Conference.

[24]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[25]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[26]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[27]  Steffen Zeuch,et al.  Non-Invasive Progressive Optimization for In-Memory Databases , 2016, Proc. VLDB Endow..

[28]  Srikanth Kandula,et al.  Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters , 2016, SIGMOD Conference.

[29]  Carlo Curino,et al.  Morpheus: Towards Automated SLOs for Enterprise Clusters , 2016, OSDI.

[30]  Christoph Koch,et al.  A Fast Randomized Algorithm for Multi-Objective Query Optimization , 2016, SIGMOD Conference.

[31]  Carlo Curino,et al.  PerfOrator: eloquent performance models for Resource Optimization , 2016, SoCC.

[32]  Chris Douglas,et al.  Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics , 2017, SIGMOD Conference.

[33]  Samuel Madden,et al.  A robust partitioning scheme for ad-hoc query workloads , 2017, SoCC.

[34]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[35]  Hiren Patel,et al.  Computation Reuse in Analytics Job Service at Microsoft , 2018, SIGMOD Conference.

[36]  Hiren Patel,et al.  Selecting Subexpressions to Materialize at Datacenter Scale , 2018, Proc. VLDB Endow..

[37]  Alekh Jindal,et al.  Thou Shall Not Recompute: Selecting Subexpressions to Materialize at Datacenter Scale , 2018 .

[38]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .