Multi-Attribute Selectivity Estimation Using Deep Learning

Selectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality estimates could result in the selection of bad plans by the query optimizer. We investigate the feasibility of using deep learning based approaches for both point and range queries and propose two complementary approaches. Our first approach considers selectivity as an unsupervised deep density estimation problem. We successfully introduce techniques from neural density estimation for this purpose. The key idea is to decompose the joint distribution into a set of tractable conditional probability distributions such that they satisfy the autoregressive property. Our second approach formulates selectivity estimation as a supervised deep learning problem that predicts the selectivity of a given query. We also introduce and address a number of practical challenges arising when adapting deep learning for relational data. These include query/data featurization, incorporating query workload information in a deep learning framework and the dynamic scenario where both data and workload queries could be updated. Our extensive experiments with a special emphasis on queries with a large number of predicates and/or small result sizes demonstrates that our proposed techniques provide fast and accurate selective estimates with minimal space overhead.

[1]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[2]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[3]  Christian S. Jensen,et al.  Lightweight graphical models for selectivity estimation without independence assumptions , 2011, Proc. VLDB Endow..

[4]  Magdalena Balazinska,et al.  Learning State Representations for Query Optimization with Deep Reinforcement Learning , 2018, DEEM@SIGMOD.

[5]  Xi Chen,et al.  Deep Unsupervised Cardinality Estimation , 2019, Proc. VLDB Endow..

[6]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[7]  Eli Upfal,et al.  Learning-based Query Performance Modeling and Prediction , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[8]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[9]  Yoshua Bengio,et al.  An Empirical Investigation of Catastrophic Forgeting in Gradient-Based Neural Networks , 2013, ICLR.

[10]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[11]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[12]  Guido Moerkotte,et al.  Improved Selectivity Estimation by Combining Knowledge from Sampling and Synopses , 2018, Proc. VLDB Endow..

[13]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[14]  Srikanth Kandula,et al.  Selectivity Estimation for Range Predicates using Lightweight Models , 2019, Proc. VLDB Endow..

[15]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[16]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[17]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[18]  Viktor Leis,et al.  Cardinality Estimation Done Right: Index-Based Join Sampling , 2017, CIDR.

[19]  Theodore Johnson,et al.  Range selectivity estimation for continuous attributes , 1999, Proceedings. Eleventh International Conference on Scientific and Statistical Database Management.

[20]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  M. Seetha Lakshmi,et al.  Selectivity Estimation in Extensible Databases - A Neural Network Approach , 1998, VLDB.

[22]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[23]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[24]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[25]  Hongyue WANG,et al.  Log-transformation and its implications for data analysis , 2014, Shanghai archives of psychiatry.

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[28]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[29]  Laverne W. Stanton,et al.  Applied Regression Analysis: A Research Tool , 1990 .

[30]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[31]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[32]  Mourad Ouzzani,et al.  Distributed representations of tuples for entity resolution , 2018, VLDB 2018.

[33]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[34]  Hugo Larochelle,et al.  MADE: Masked Autoencoder for Distribution Estimation , 2015, ICML.

[35]  Daan Wierstra,et al.  Deep AutoRegressive Networks , 2013, ICML.

[36]  J. Neter,et al.  Applied Linear Regression Models , 1983 .

[37]  Ion Stoica,et al.  Learning to Optimize Join Queries With Deep Reinforcement Learning , 2018, ArXiv.

[38]  Christian S. Jensen,et al.  A Reinforcement Learning Approach for Adaptive Query Processing , 2008 .

[39]  Olga Papaemmanouil,et al.  Deep Reinforcement Learning for Join Order Enumeration , 2018, aiDM@SIGMOD.

[40]  Hugo Larochelle,et al.  RNADE: The real-valued neural autoregressive density-estimator , 2013, NIPS.

[41]  Andreas Kipf,et al.  Learned Cardinalities: Estimating Correlated Joins with Deep Learning , 2018, CIDR.

[42]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[43]  Surajit Chaudhuri,et al.  Conditional selectivity for statistics on query expressions , 2004, SIGMOD '04.

[44]  Volker Markl,et al.  Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models , 2017, Proc. VLDB Endow..

[45]  Dimitrios Gunopulos,et al.  Selectivity estimators for multidimensional range queries over real attributes , 2005, The VLDB Journal.