Sample Debiasing in the Themis Open World Database System

Open world database management systems assume tuples not in the database still exist and are becoming an increasingly important area of research. We present Themis, the first open world database that automatically rebalances arbitrarily biased samples to approximately answer queries as if they were issued over the entire population. We leverage apriori population aggregate information to develop and combine two different approaches for automatic debiasing: sample reweighting and Bayesian network probabilistic modeling. We build a prototype of Themis and demonstrate that Themis achieves higher query accuracy than the default AQP approach, an alternative sample reweighting technique, and a variety of Bayesian network models while maintaining interactive query response times. We also show that Themis is robust to differences in the support between the sample and population, a key use case when using social media samples.

[1]  Junshan Zhang,et al.  Modeling social network relationships via t-cherry junction trees , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[2]  M. Lovell Seasonal Adjustment of Economic Time Series and Multiple Regression Analysis , 1963 .

[3]  Richard Sinkhorn Diagonal equivalence to matrices with prescribed row and column sums. II , 1967 .

[4]  Daisy Zhe Wang,et al.  Extracting and Querying Probabilistic Information in BayesStore , 2011 .

[5]  Guido Moerkotte,et al.  Improved Selectivity Estimation by Combining Knowledge from Sampling and Synopses , 2018, Proc. VLDB Endow..

[6]  Venkata Rama Kiran Garimella,et al.  Inferring international and internal migration patterns from Twitter data , 2014, WWW.

[7]  Barzan Mozafari,et al.  A Handbook for Building an Approximate Query Engine , 2015, IEEE Data Eng. Bull..

[8]  Christos Faloutsos,et al.  NetCube: A Scalable Tool for Fast Data Mining and Compression , 2001, VLDB.

[9]  Luis M. de Campos,et al.  A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests , 2006, J. Mach. Learn. Res..

[10]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[11]  Nir Friedman,et al.  Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm , 1999, UAI.

[12]  Toniann Pitassi,et al.  Learning Fair Representations , 2013, ICML.

[13]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[14]  M. D. McKay,et al.  Creating synthetic baseline populations , 1996 .

[15]  Oluwasanmi Koyejo,et al.  Generalized Linear Models for Aggregated Data , 2016, AISTATS.

[16]  Oliver Schulte,et al.  FactorBase : Multi-relational model learning with SQL all the way , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[17]  Carsten Binnig,et al.  IDEBench: A Benchmark for Interactive Data Exploration , 2018, SIGMOD Conference.

[18]  Viktor Leis,et al.  How Good Are Query Optimizers, Really? , 2015, Proc. VLDB Endow..

[19]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[20]  Michael J. Cafarella,et al.  Database Learning: Toward a Database that Becomes Smarter Every Time , 2017, SIGMOD Conference.

[21]  D. Margaritis Learning Bayesian Network Model Structure from Data , 2003 .

[22]  Lane F Burgette,et al.  A tutorial on propensity score estimation for multiple treatments using generalized boosted models , 2013, Statistics in medicine.

[23]  P. Austin An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies , 2011, Multivariate behavioral research.

[24]  Qing Liu Approximate Query Processing , 2009, Encyclopedia of Database Systems.

[25]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[26]  Elke A. Rundensteiner,et al.  Refinement Driven Processing of Aggregation Constrained Queries , 2016, EDBT.

[27]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[28]  Guoliang Li,et al.  Approximate Query Processing: What is New and Where to Go? , 2018, Data Science and Engineering.

[29]  Der-Horng Lee,et al.  Cross-Entropy Optimization Model for Population Synthesis in Activity-Based Microsimulation Models , 2011 .

[30]  Mehryar Mohri,et al.  Sample Selection Bias Correction Theory , 2008, ALT.

[31]  András Prékopa,et al.  Probability Bounds with Cherry Trees , 2001, Math. Oper. Res..

[32]  Michel Bierlaire,et al.  Simulation based Population Synthesis , 2013 .

[33]  Surajit Chaudhuri,et al.  Leveraging aggregate constraints for deduplication , 2007, SIGMOD '07.

[34]  Daniel Zelterman,et al.  Bayesian Artificial Intelligence , 2005, Technometrics.

[35]  Shehroz S. Khan,et al.  A Survey of Recent Trends in One Class Classification , 2009, AICS.

[36]  Martin Idel A review of matrix scaling and Sinkhorn's normal form for matrices and positive maps , 2016, 1609.06349.

[37]  Max Henrion,et al.  Propagating uncertainty in bayesian networks by probabilistic logic sampling , 1986, UAI.

[38]  W. Deming,et al.  On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known , 1940 .

[39]  Phillipp Kaestner,et al.  Linear And Nonlinear Programming , 2016 .

[40]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[41]  Jean-François Beaumont,et al.  A new approach to weighting and inference in sample surveys , 2008 .

[42]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[43]  Ihab F. Ilyas,et al.  Trends in Cleaning Relational Data: Consistency and Deduplication , 2015, Found. Trends Databases.

[44]  Surajit Chaudhuri,et al.  Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee , 2016, SIGMOD Conference.

[45]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[46]  Kevin B. Korb,et al.  Bayesian Artificial Intelligence, Second Edition , 2010 .

[47]  Alexander Erath,et al.  A Bayesian network approach for population synthesis , 2015 .

[48]  Qiang Ji,et al.  Constrained Maximum Likelihood Learning of Bayesian Networks for Facial Action Recognition , 2008, ECCV.

[49]  Carlo Zaniolo,et al.  The analytical bootstrap: a new method for fast error estimation in approximate query processing , 2014, SIGMOD Conference.

[50]  David R. Musicant,et al.  Learning from Aggregate Views , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[51]  Nir Friedman,et al.  Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting , 1998, ICML.

[52]  Cynthia Dwork,et al.  Differential Privacy and the US Census , 2019, PODS.

[53]  Kirill Müller,et al.  A Generalized Approach to Population Synthesis , 2017 .

[54]  Carsten Binnig,et al.  Revisiting Reuse for Approximate Query Processing , 2017, Proc. VLDB Endow..

[55]  Y HalevyAlon Answering queries using views: A survey , 2001, VLDB 2001.

[56]  Jian Pei,et al.  AQP++: Connecting Approximate Query Processing With Aggregate Precomputation for Interactive Analytics , 2018, SIGMOD Conference.

[57]  Robin Lovelace,et al.  Evaluating the Performance of Iterative Proportional Fitting for Spatial Microsimulation: New Tests for an Established Technique , 2015, J. Artif. Soc. Soc. Simul..

[58]  Srikanth Kandula,et al.  Approximate Query Processing: No Silver Bullet , 2017, SIGMOD Conference.

[59]  Dan Suciu,et al.  Probabilistic Database Summarization for Interactive Data Exploration , 2017, Proc. VLDB Endow..

[60]  David R. Musicant,et al.  Supervised Learning by Training on Aggregate Outputs , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[61]  Samuel Madden,et al.  MauveDB: supporting model-based user views in database systems , 2006, SIGMOD Conference.

[62]  Carlos Eduardo Scheidegger,et al.  Certifying and Removing Disparate Impact , 2014, KDD.

[63]  Tamás Szántai,et al.  Discovering a junction tree behind a Markov network by a greedy algorithm , 2011, ArXiv.

[64]  I JordanMichael,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008 .

[65]  Tom M. Mitchell,et al.  Exploiting parameter domain knowledge for learning in bayesian networks , 2005 .

[66]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[67]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[68]  Tim Kraska,et al.  A sample-and-clean framework for fast and accurate query processing on dirty data , 2014, SIGMOD Conference.

[69]  D. Rubin,et al.  Reducing Bias in Observational Studies Using Subclassification on the Propensity Score , 1984 .