Learning From Query-Answers

Tuple-independent and disjoint-independent probabilistic databases (TI- and DI-PDBs) represent uncertain data in a factorized form as a product of independent random variables that represent either tuples (TI-PDBs) or sets of tuples (DI-PDBs). When the user submits a query, the database derives the marginal probabilities of each output-tuple, exploiting the underlying assumptions of statistical independence. While query processing in TI- and DI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results. Addressing this problem is the main focus of this article. We first introduce Beta Probabilistic Databases (B-PDBs), a generalization of TI-PDBs designed to support both (i) belief updating and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and parameter learning in a principled way, without imposing any burden on regular query processing. Building on B-PDBs, we then introduce Dirichlet Probabilistic Databases (D-PDBs), a generalization of DI-PDBs with similar properties. We provide the following key contributions for both B- and D-PDBs: (i) We study the complexity of performing Bayesian belief updates and devise efficient algorithms for certain tractable classes of queries; (ii) we propose a soft-EM algorithm for computing maximum-likelihood estimates of the parameters; (iii) we present an algorithm for efficiently computing conditional probabilities, allowing us to efficiently implement B- and D-PDBs via a standard relational engine; and (iv) we support our conclusions with extensive experimental results.

[1]  Martin Theobald,et al.  Learning Tuple Probabilities in Probabilistic Databases , 2014 .

[2]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[3]  David Poole,et al.  Probabilistic Horn Abduction and Bayesian Networks , 1993, Artif. Intell..

[4]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[5]  Dan Suciu,et al.  Approximate Lifted Inference with Probabilistic Databases , 2014, Proc. VLDB Endow..

[6]  Dan Suciu,et al.  Oblivious bounds on the probability of boolean functions , 2014, ACM Trans. Database Syst..

[7]  Dan Suciu,et al.  Reverse data management , 2011, Proc. VLDB Endow..

[8]  Christoph Koch,et al.  PIP: A database system for great and small expectations , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[9]  Luc De Raedt,et al.  Parameter Learning in Probabilistic Databases: A Least Squares Approach , 2008, ECML/PKDD.

[10]  Vibhav Gogate,et al.  Dissociation-Based Oblivious Bounds for Weighted Model Counting , 2018, UAI.

[11]  Dan Olteanu,et al.  MayBMS: a probabilistic database management system , 2009, SIGMOD Conference.

[12]  Pushpa N. Rathie,et al.  On the entropy of continuous probability distributions (Corresp.) , 1978, IEEE Trans. Inf. Theory.

[13]  R. Herbrich Minimising the Kullback-Leibler Divergence , 2005 .

[14]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[15]  Luc De Raedt,et al.  Learning the Parameters of Probabilistic Logic Programs from Interpretations , 2011, ECML/PKDD.

[16]  Jian Li,et al.  Sensitivity analysis and explanations for robust query evaluation in probabilistic databases , 2011, SIGMOD '11.

[17]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[18]  Floris Geerts,et al.  A General Framework for Anytime Approximation in Probabilistic Databases , 2018, ArXiv.

[19]  Peter J. Haas,et al.  Simulation of database-valued markov chains using SimSQL , 2013, SIGMOD '13.

[20]  Jennie Duggan,et al.  Hephaestus: Data Reuse for Accelerating Scientific Discovery , 2015, CIDR.

[21]  Johannes Gehrke,et al.  Coordination through querying in the youtopia system , 2011, SIGMOD '11.

[22]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[23]  N. L. Johnson,et al.  Continuous Univariate Distributions. , 1995 .

[24]  Richard M. Karp,et al.  Monte-Carlo Approximation Algorithms for Enumeration Problems , 1989, J. Algorithms.

[25]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[26]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[27]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[28]  Susanne E. Hambrusch,et al.  Orion 2.0: native support for uncertain data , 2008, SIGMOD Conference.

[29]  Martin Theobald,et al.  Querying and Learning in Probabilistic Databases , 2014, Reasoning Web.

[30]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[31]  R. Okafor Maximum likelihood estimation from incomplete data , 1987 .

[32]  Wolfgang Gatterbauer,et al.  Beta Probabilistic Databases: A Scalable Approach to Belief Updating and Parameter Learning , 2017, SIGMOD Conference.

[33]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[34]  Lise Getoor,et al.  PrDB: managing and exploiting rich correlations in probabilistic databases , 2009, The VLDB Journal.

[35]  Dan Olteanu,et al.  Approximate confidence computation in probabilistic databases , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[36]  Dan Olteanu,et al.  Anytime approximation in probabilistic databases , 2013, The VLDB Journal.

[37]  Dan Suciu Probabilistic Databases , 2018, Encyclopedia of Database Systems.

[38]  Dan Olteanu,et al.  Secondary-storage confidence computation for conjunctive queries with inequalities , 2009, SIGMOD Conference.

[39]  Ying Yang,et al.  Lenses: An On-Demand Approach to ETL , 2015, Proc. VLDB Endow..

[40]  Udi Rotics,et al.  Factoring and recognition of read-once functions using cographs and normality and the readability of functions associated with partial k-trees , 2006, Discret. Appl. Math..

[41]  Luc De Raedt,et al.  ProbLog: A Probabilistic Prolog and its Application in Link Discovery , 2007, IJCAI.

[42]  Dan Suciu,et al.  The dichotomy of probabilistic inference for unions of conjunctive queries , 2012, JACM.

[43]  Dan Olteanu,et al.  Using OBDDs for Efficient Query Evaluation on Probabilistic Databases , 2008, SUM.

[44]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[45]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[46]  Dan Olteanu,et al.  Conditioning probabilistic databases , 2008, Proc. VLDB Endow..

[47]  Tova Milo,et al.  Deriving probabilistic databases with inference ensembles , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[48]  Val Tannen,et al.  Provenance in ORCHESTRA , 2010, IEEE Data Eng. Bull..

[49]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[50]  Ben Taskar,et al.  Probabilistic Relational Models , 2014, Encyclopedia of Social Network Analysis and Mining.

[51]  Andrew Hogue,et al.  SPROUT2: a squared query engine for uncertain web data , 2011, SIGMOD '11.

[52]  Kathryn B. Laskey MEBN: A language for first-order Bayesian knowledge bases , 2008, Artif. Intell..

[53]  Dina Q. Goldin,et al.  Constraint Programming and Database Query Languages , 1994, TACS.

[54]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[55]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.