Infinite Probabilistic Databases

Probabilistic databases (PDBs) are used to model uncertainty in data in a quantitative way. In the standard formal framework, PDBs are finite probability spaces over relational database instances. It has been argued convincingly that this is not compatible with an open world semantics (Ceylan et al., KR 2016) and with application scenarios that are modeled by continuous probability distributions (Dalvi et al., CACM 2009). We recently introduced a model of PDBs as infinite probability spaces that addresses these issues (Grohe and Lindner, PODS 2019). While that work was mainly concerned with countably infinite probability spaces, our focus here is on uncountable spaces. Such an extension is necessary to model typical continuous probability distributions that appear in many applications. However, an extension beyond countable probability spaces raises nontrivial foundational issues concerned with the measurability of events and queries and ultimately with the question whether queries have a well-defined semantics. It turns out that so-called finite point processes are the appropriate model from probability theory for dealing with probabilistic databases. This model allows us to construct suitable (uncountable) probability spaces of database instances in a systematic way. Our main technical results are measurability statements for relational algebra queries as well as aggregate queries and datalog queries.

[1]  Joseph Albert,et al.  Algebraic Properties of Bag Data Types , 1991, VLDB.

[2]  Ulrike Goldschmidt,et al.  An Introduction To The Theory Of Point Processes , 2016 .

[3]  Guy Van den Broeck,et al.  On Constrained Open-World Probabilistic Databases , 2018, IJCAI.

[4]  Evgeny Kharlamov,et al.  Capturing continuous data and answering aggregate queries in probabilistic XML , 2011, TODS.

[5]  Raymond Reiter On Closed World Data Bases , 1977, Logic and Data Bases.

[6]  Robert B. Ross,et al.  Aggregate operators in probabilistic databases , 2005, JACM.

[7]  Christoph Koch,et al.  PIP: A database system for great and small expectations , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[8]  Shashi M. Srivastava,et al.  A Course on Borel Sets , 1998, Graduate texts in mathematics.

[9]  Maurice van Keulen,et al.  Revisiting the formal foundation of Probabilistic Databases , 2015, IFSA-EUSFLAT.

[10]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[11]  Stuart J. Russell,et al.  Probabilistic models with unknown objects , 2006 .

[12]  Pedro M. Domingos,et al.  Markov Logic in Infinite Domains , 2007, UAI.

[13]  Limsoon Wong,et al.  Query languages for bags: expressive power and complexity , 1996, SIGA.

[14]  Michael Pittarelli,et al.  The Theory of Probabilistic Databases , 1987, VLDB.

[15]  Sumit Sarkar,et al.  A probabilistic relational model and algebra , 1996, TODS.

[16]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[17]  Thomas Lukasiewicz,et al.  Ontology-Mediated Queries for Probabilistic Databases , 2017, AAAI.

[18]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[19]  Erol Gelenbe,et al.  A probability model of uncertainty in data bases , 1986, 1986 IEEE Second International Conference on Data Engineering.

[20]  Achim Klenke,et al.  Probability theory - a comprehensive course , 2008, Universitext.

[21]  Esteban Zimányi,et al.  Query Evaluation in Probabilistic Relational Databases , 1997, Theor. Comput. Sci..

[22]  Eugene Wong,et al.  A statistical approach to incomplete information in database systems , 1982, TODS.

[23]  Randy H. Katz,et al.  An extended relational algebra with control over duplicate elimination , 1982, PODS.

[24]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[25]  Susanne E. Hambrusch,et al.  Database Support for Probabilistic Attributes and Tuples , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[26]  K. Brown,et al.  Graduate Texts in Mathematics , 1982 .

[27]  N. Bourbaki General Topology: Chapters 1–4 , 1989 .

[28]  Tova Milo,et al.  Towards Tractable Algebras for Bags , 1996, J. Comput. Syst. Sci..

[29]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[30]  Christopher Ré,et al.  Probabilistic databases: diamonds in the dirt , 2009, CACM.

[31]  Thomas Lukasiewicz,et al.  Ontology-Mediated Query Answering over Log-Linear Probabilistic Data (Abstract) , 2019, Description Logics.

[32]  Thomas A. Henzinger,et al.  Probabilistic programming , 2014, FOSE.

[33]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[34]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[35]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[36]  Luc De Raedt,et al.  Statistical Relational Artificial Intelligence: Logic, Probability, and Computation , 2016, Statistical Relational Artificial Intelligence.

[37]  Vaishak Belle,et al.  Symbolic Logic meets Machine Learning: A Brief Survey in Infinite Domains , 2020, SUM.

[38]  J. E. Moyal The general theory of stochastic population processes , 1962 .

[39]  Christopher Ré,et al.  Probabilistic databases , 2011, SIGA.

[40]  Peter Franken,et al.  POINT PROCESS METHOD IN QUEUEING THEORY , 1982 .

[41]  DeyDebabrata,et al.  A probabilistic relational model and algebra , 1996 .

[42]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[43]  Dan Olteanu,et al.  Conditioning probabilistic databases , 2008, Proc. VLDB Endow..

[44]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[45]  Christoph Koch,et al.  On Query Algebras for Probabilistic Databases , 2009, SGMD.

[46]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[47]  Esko Valkeila,et al.  An Introduction to the Theory of Point Processes, Volume II: General Theory and Structure, 2nd Edition by Daryl J. Daley, David Vere‐Jones , 2008 .

[48]  Martin Grohe,et al.  Probabilistic Databases with an Infinite Open-World Assumption , 2018, PODS.

[49]  Adrian Baddeley,et al.  Spatial Point Processes and their Applications , 2007 .

[50]  Ronald P. S. Mahler,et al.  Statistical Multisource-Multitarget Information Fusion , 2007 .

[51]  Dan Olteanu,et al.  Aggregation in Probabilistic Databases via Knowledge Compilation , 2012, Proc. VLDB Endow..

[52]  Benjamin Naumann,et al.  Classical Descriptive Set Theory , 2016 .

[53]  Xiaoling Li,et al.  A survey of queries over uncertain data , 2013, Knowledge and Information Systems.

[54]  Michael Pittarelli,et al.  An Algebra for Probabilistic Databases , 1994, IEEE Trans. Knowl. Data Eng..

[55]  Sam Staton,et al.  A Monad for Probabilistic Point Processes , 2021, ACT.

[56]  Charu C. Aggarwal,et al.  MayBMS A System for Managing Large Probabilistic Databases , 2009 .

[57]  Peter J. Haas,et al.  Simulation of database-valued markov chains using SimSQL , 2013, SIGMOD '13.

[58]  Peter J. Haas,et al.  The monte carlo database system: Stochastic analysis close to the data , 2011, TODS.

[59]  Guy Van den Broeck,et al.  Query Processing on Probabilistic Data: A Survey , 2017, Found. Trends Databases.

[60]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[61]  Yi Wu,et al.  Discrete-Continuous Mixtures in Probabilistic Programming: Generalized Semantics and Inference Algorithms , 2018, ICML.

[62]  Norbert Fuhr,et al.  Probabilistic Datalog—a logic for powerful retrieval methods , 1995, SIGIR '95.

[63]  Thomas Lukasiewicz,et al.  Recent Advances in Querying Probabilistic Knowledge Bases , 2018, IJCAI.

[64]  Dan Suciu,et al.  Continuous Uncertainty in Trio , 2009, MUD.

[65]  Andrew McGregor,et al.  CLARO: modeling and processing uncertain data streams , 2012, The VLDB Journal.

[66]  Daniel Deutch,et al.  On probabilistic fixpoint and Markov chain query languages , 2010, PODS '10.

[67]  Philippe Bonnet,et al.  GADT: a probability space ADT for representing and querying the physical world , 2002, Proceedings 18th International Conference on Data Engineering.

[68]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[69]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[70]  O. Macchi The coincidence approach to stochastic point processes , 1975, Advances in Applied Probability.

[71]  Guy Van den Broeck,et al.  Open World Probabilistic Databases (Extended Abstract) , 2016, Description Logics.

[72]  Alain Pirotte,et al.  Imperfect Information in Relational Databases , 1996, Uncertainty Management in Information Systems.

[73]  Balder ten Cate,et al.  Declarative Probabilistic Programming with Datalog , 2017, ACM Trans. Database Syst..

[74]  Jennifer Widom,et al.  Making Aggregation Work in Uncertain and Probabilistic Databases , 2011, IEEE Transactions on Knowledge and Data Engineering.

[75]  Stuart J. Russell,et al.  BLOG: Probabilistic Models with Unknown Objects , 2005, IJCAI.