PGMJoins: Random Join Sampling with Graphical Models

Modern databases face formidable challenges when called to join (several) massive tables. Joins (especially when entailing many-to-many joins) are very time- and resource-consuming, join results can be too big to keep in memory, and performing analytics/learning tasks over them costs dearly in terms of time, resources, and money (in the cloud). Moreover, although random sampling is a promising idea to mitigate the above problems, the current state of the art leaves lots of room for improvements. With this paper we contribute a principled solution, coined PGMJoins. PGMJoins adapts Probabilistic Graphical Models to deriving provably random samples of the join result for (n-way) key joins, many-to-many joins, and cyclic and acyclic joins. PGMJoins contributes optimizations both for deriving the structure of the graph and for PGM inference. It also contributes a novel Sum-Product Message Passing Algorithm (SP-MPA) to make a uniform sample of the joint distribution (join result) efficiently and a novel way to deal with cyclic joins. Despite the use of PGMs, the learned joint distribution is not approximated, and the uniform samples are drawn from the true distribution. Our experimentation using queries and datasets from TPC-H, JOB, TPC-DS, and Twitter shows PGMJoins to outperform the state of the art (by 2X-28X).

[1]  Christian S. Jensen,et al.  Efficiently adapting graphical models for selectivity estimation , 2012, The VLDB Journal.

[2]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[3]  Peter Triantafillou,et al.  DBEst: Revisiting Approximate Query Processing Engines with Machine Learning Models , 2019, SIGMOD Conference.

[4]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[5]  Barzan Mozafari,et al.  VerdictDB: Universalizing Approximate Query Processing , 2018, SIGMOD Conference.

[6]  Dan Olteanu,et al.  Learning Linear Regression Models over Factorized Joins , 2016, SIGMOD Conference.

[7]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[8]  Xiaohui Yu,et al.  Hashed samples: selectivity estimators for set similarity selection queries , 2008, Proc. VLDB Endow..

[9]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[10]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[11]  Feifei Li,et al.  Random Sampling over Joins Revisited , 2018, SIGMOD Conference.

[12]  Nevin L. Zhang,et al.  A simple approach to Bayesian network computations , 1994 .

[13]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems - Exact Computational Methods for Bayesian Networks , 1999, Information Science and Statistics.

[14]  Carsten Binnig,et al.  DeepDB , 2019, Proc. VLDB Endow..

[15]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[16]  Peter Triantafillou,et al.  Learning Set Cardinality in Distance Nearest Neighbours , 2015, 2015 IEEE International Conference on Data Mining.

[17]  Judea Pearl,et al.  Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach , 1982, AAAI.

[18]  Nick Koudas,et al.  Approximate Query Processing using Deep Generative Models , 2019, ArXiv.

[19]  C. Robert The Metropolis–Hastings Algorithm , 2015, 1504.01896.

[20]  Peter J. Haas,et al.  Large-sample and deterministic confidence intervals for online aggregation , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[21]  Michael J. Cafarella,et al.  Database Learning: Toward a Database that Becomes Smarter Every Time , 2017, SIGMOD Conference.

[22]  Xi Chen,et al.  NeuroCard , 2020, Proc. VLDB Endow..

[23]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[24]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[25]  Jeffrey F. Naughton,et al.  Learning Generalized Linear Models Over Normalized Data , 2015, SIGMOD Conference.

[26]  Paola Sebastiani,et al.  Learning Bayesian Networks from Incomplete Databases , 1997, UAI.

[27]  Peter Triantafillou,et al.  Learned Approximate Query Processing: Make it Light, Accurate and Fast , 2021, CIDR.

[28]  Bin Wu,et al.  Wander Join: Online Aggregation via Random Walks , 2016, SIGMOD Conference.

[29]  Bradley P. Carlin,et al.  Markov Chain Monte Carlo in Practice: A Roundtable Discussion , 1998 .

[30]  Sunita Sarawagi,et al.  Probabilistic Graphical Models and their Role in Databases , 2007, VLDB.

[31]  Joe Suzuki,et al.  A Construction of Bayesian Networks from Databases Based on an MDL Principle , 1993, UAI.

[32]  Peter Triantafillou,et al.  Efficient Scalable Accurate Regression Queries in In-DBMS Analytics , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[33]  Thore Graepel,et al.  Compiling Relational Database Schemata into Probabilistic Graphical Models , 2012, ArXiv.

[34]  Christian S. Jensen,et al.  Lightweight graphical models for selectivity estimation without independence assumptions , 2011, Proc. VLDB Endow..

[35]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[36]  Szymon Jaroszewicz,et al.  Fast discovery of unexpected patterns in data, relative to a Bayesian network , 2005, KDD '05.

[37]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[38]  By W. R. GILKSt,et al.  Adaptive Rejection Sampling for Gibbs Sampling , 2010 .

[39]  Peter Triantafillou,et al.  Query-Driven Learning for Predictive Analytics of Data Subspace Cardinality , 2017, ACM Trans. Knowl. Discov. Data.

[40]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[41]  Chris Jermaine,et al.  Scalable approximate query processing with the DBO engine , 2008, TODS.

[42]  Mohamed Nazih Omri,et al.  Bayesian Network Based Information Retrieval Model , 2017, 2017 International Conference on High Performance Computing & Simulation (HPCS).

[43]  Sridhar Ramaswamy,et al.  Join synopses for approximate query answering , 1999, SIGMOD '99.