Are We Ready For Learned Cardinality Estimation?

Cardinality estimation is a fundamental but long unresolved problem in query optimization. Recently, multiple papers from different research groups consistently report that learned models have the potential to replace existing cardinality estimators. In this paper, we ask a forward-thinking question: Are we ready to deploy these learned cardinality models in production? Our study consists of three main parts. Firstly, we focus on the static environment (i.e., no data updates) and compare five new learned methods with eight traditional methods on four real-world datasets under a unified workload setting. The results show that learned models are indeed more accurate than traditional methods, but they often suffer from high training and inference costs. Secondly, we explore whether these learned models are ready for dynamic environments (i.e., frequent data updates). We find that they cannot catch up with fast data up-dates and return large errors for different reasons. For less frequent updates, they can perform better but there is no clear winner among themselves. Thirdly, we take a deeper look into learned models and explore when they may go wrong. Our results show that the performance of learned methods can be greatly affected by the changes in correlation, skewness, or domain size. More importantly, their behaviors are much harder to interpret and often unpredictable. Based on these findings, we identify two promising research directions (control the cost of learned models and make learned models trustworthy) and suggest a number of research opportunities. We hope that our study can guide researchers and practitioners to work together to eventually push learned cardinality estimators into real database systems.

[1]  Paolo Frasconi,et al.  Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[2]  Felix Naumann,et al.  Cardinality Estimation: An Experimental Survey , 2017, Proc. VLDB Endow..

[3]  Eli Upfal,et al.  The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling , 2011, ECML/PKDD.

[4]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[5]  Guido Moerkotte,et al.  Improved Selectivity Estimation by Combining Knowledge from Sampling and Synopses , 2018, Proc. VLDB Endow..

[6]  Alan Wood,et al.  Adaptive Statistics in Oracle 12c , 2017, Proc. VLDB Endow..

[7]  Magdalena Balazinska,et al.  An Empirical Analysis of Deep Learning for Cardinality Estimation , 2019, ArXiv.

[8]  Cyrus Shahabi,et al.  Entropy-based histograms for selectivity estimation , 2013, CIKM.

[9]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[10]  Chengliang Chai,et al.  Database Meets Artificial Intelligence: A Survey , 2020, IEEE Transactions on Knowledge and Data Engineering.

[11]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[12]  Barzan Mozafari,et al.  QuickSel: Quick Selectivity Learning with Mixture Models , 2018, SIGMOD Conference.

[13]  Dimitrios Gunopulos,et al.  Selectivity estimators for multidimensional range queries over real attributes , 2005, The VLDB Journal.

[14]  Gregory Piatetsky-Shapiro,et al.  Accurate estimation of the number of tuples satisfying a condition , 1984, SIGMOD '84.

[15]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[16]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[17]  Dan Suciu,et al.  Pessimistic Cardinality Estimation: Tighter Upper Bounds for Intermediate Join Cardinalities , 2019, SIGMOD Conference.

[18]  Guoliang Li,et al.  Reinforcement Learning with Tree-LSTM for Join Order Selection , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[19]  Peter Triantafillou,et al.  Learning to accurately COUNT with query-driven predictive analytics , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[20]  Pedro M. Domingos,et al.  Sum-product networks: A new deep architecture , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[21]  Carlos Guestrin,et al.  Anchors: High-Precision Model-Agnostic Explanations , 2018, AAAI.

[22]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[23]  Tim Kraska,et al.  Neo: A Learned Query Optimizer , 2019, Proc. VLDB Endow..

[24]  Aleksander Kolcz,et al.  Feature Weighting for Improved Classifier Robustness , 2009, CEAS 2009.

[25]  Guido Moerkotte,et al.  Preventing Bad Plans by Bounding the Impact of Cardinality Estimation Errors , 2009, Proc. VLDB Endow..

[26]  Nick Roussopoulos,et al.  Adaptive selectivity estimation using query feedback , 1994, SIGMOD '94.

[27]  Immanuel Trummer,et al.  SkinnerDB: Regret-Bounded Query Evaluation via Reinforcement Learning , 2018, Proc. VLDB Endow..

[28]  Andreas Kipf,et al.  Learned Cardinalities: Estimating Correlated Joins with Deep Learning , 2018, CIDR.

[29]  G. Lepage A new algorithm for adaptive multidimensional integration , 1978 .

[30]  Allen Van Gelder,et al.  Multiple Join Size Estimation by Virtual Domains. , 1993, PODS 1993.

[31]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[32]  Jure Leskovec,et al.  Interpretable Decision Sets: A Joint Framework for Description and Prediction , 2016, KDD.

[33]  Ke Zhou,et al.  An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning , 2019, SIGMOD Conference.

[34]  Hugo Larochelle,et al.  MADE: Masked Autoencoder for Distribution Estimation , 2015, ICML.

[35]  Surajit Chaudhuri,et al.  Efficiently approximating selectivity functions using low overhead regression models , 2020, Proc. VLDB Endow..

[36]  Volker Markl,et al.  Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation , 2015, SIGMOD Conference.

[37]  Viktor Leis,et al.  Cardinality Estimation Done Right: Index-Based Join Sampling , 2017, CIDR.

[38]  Jeffrey F. Naughton,et al.  Sampling-Based Query Re-Optimization , 2016, SIGMOD Conference.

[39]  Ion Stoica,et al.  Learning to Optimize Join Queries With Deep Reinforcement Learning , 2018, ArXiv.

[40]  Tim Kraska,et al.  The Case for a Learned Sorting Algorithm , 2020, SIGMOD Conference.

[41]  Edward Raff,et al.  Non-Negative Networks Against Adversarial Attacks , 2018, ArXiv.

[42]  Bernhard Schölkopf,et al.  The Randomized Dependence Coefficient , 2013, NIPS.

[43]  Daniel Lemire,et al.  Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources , 2018, SIGMOD Conference.

[44]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[45]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[46]  Surajit Chaudhuri,et al.  Self-tuning histograms: building histograms without looking at data , 1999, SIGMOD '99.

[47]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[48]  Calisto Zuzarte,et al.  Cardinality estimation using neural networks , 2015, CASCON.

[49]  Olga Papaemmanouil,et al.  Deep Reinforcement Learning for Join Order Enumeration , 2018, aiDM@SIGMOD.

[50]  George C. Caragea,et al.  Orca: a modular query optimizer architecture for big data , 2014, SIGMOD Conference.

[51]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[52]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[53]  Volker Markl,et al.  Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models , 2017, Proc. VLDB Endow..

[54]  Beng Chin Ooi,et al.  Global optimization of histograms , 2001, SIGMOD '01.

[55]  Peter J. Haas,et al.  ISOMER: Consistent Histogram Construction Using Query Feedback , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[56]  Xi Chen,et al.  NeuroCard: One Cardinality Estimator for All Tables , 2020, VLDB 2020.

[57]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[58]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[59]  Nick Koudas,et al.  Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries , 2020, SIGMOD Conference.

[60]  M. Seetha Lakshmi,et al.  Selectivity Estimation in Extensible Databases - A Neural Network Approach , 1998, VLDB.

[61]  Srikanth Kandula,et al.  Selectivity Estimation for Range Predicates using Lightweight Models , 2019, Proc. VLDB Endow..

[62]  Júlio C. Nievola,et al.  An Adaptive Approach for Index Tuning with Learning Classifier Systems on Hybrid Storage Environments , 2018, HAIS.

[63]  Hiren Patel,et al.  Computation Reuse in Analytics Job Service at Microsoft , 2018, SIGMOD Conference.

[64]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[65]  Jacek M. Zurada,et al.  Learning Understandable Neural Networks With Nonnegative Weight Constraints , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[66]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[67]  Hongyue WANG,et al.  Log-transformation and its implications for data analysis , 2014, Shanghai archives of psychiatry.

[68]  Magdalena Balazinska,et al.  Learning State Representations for Query Optimization with Deep Reinforcement Learning , 2018, DEEM@SIGMOD.

[69]  Wolfgang Lehner,et al.  Cardinality estimation with local deep learning models , 2019, aiDM@SIGMOD.

[70]  Peter J. Haas,et al.  Consistently Estimating the Selectivity of Conjuncts of Predicates , 2005, VLDB.

[71]  Christian S. Jensen,et al.  Lightweight graphical models for selectivity estimation without independence assumptions , 2011, Proc. VLDB Endow..

[72]  David J. DeWitt,et al.  Equi-depth multidimensional histograms , 1988, SIGMOD '88.

[73]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[74]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[75]  Dimitrios Gunopulos,et al.  Approximating multi-dimensional aggregate range queries over real attributes , 2000, SIGMOD '00.

[76]  Quoc V. Le,et al.  Neural Optimizer Search with Reinforcement Learning , 2017, ICML.

[77]  Jeffrey Scott Vitter,et al.  SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads , 2003, VLDB.

[78]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[79]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[80]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[81]  Rainer Gemulla,et al.  Sampling algorithms for evolving datasets , 2008 .

[82]  Rajeev Rastogi,et al.  Independence is good: dependency-based histogram synopses for high-dimensional data , 2001, SIGMOD '01.

[83]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[84]  Hongjun Lu,et al.  Effective Query Size Estimation Using Neural Networks , 2004, Applied Intelligence.

[85]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[86]  Divyakant Agrawal,et al.  Applying the golden rule of sampling for query estimation , 2001, SIGMOD '01.

[87]  Feifei Li,et al.  iBTune: Individualized Buffer Tuning for Large-scale Cloud Databases , 2019, Proc. VLDB Endow..

[88]  Guoliang Li,et al.  An End-to-End Learning-based Cost Estimator , 2019, Proc. VLDB Endow..

[89]  Michael Stonebraker,et al.  How I Learned to Stop Worrying and Love Re-optimization , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[90]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[91]  Carsten Binnig,et al.  DeepDB , 2019, Proc. VLDB Endow..