Probabilistic databases

Probabilistic databases are databases where the value of some attributes or the presence of some records are uncertain and known only with some probability. Applications in many areas such as information extraction, RFID and scientific data management, data cleaning, data integration, and financial risk assessment produce large volumes of uncertain data, which are best modeled and processed by a probabilistic database. This book presents the state of the art in representation formalisms and query processing techniques for probabilistic data. It starts by discussing the basic principles for representing large probabilistic databases, by decomposing them into tuple-independent tables, block-independentdisjoint tables, or U-databases. Then it discusses two classes of techniques for query evaluation on probabilistic databases. In extensional query evaluation, the entire probabilistic inference can be pushed into the database engine and, therefore, processed as effectively as the evaluation of standard SQL queries. The relational queries that can be evaluated this way are called safe queries. In intensional query evaluation, the probabilistic inference is performed over a propositional formula called lineage expression: every relational query can be evaluated this way, but the data complexity dramatically depends on the query being evaluated, and can be #P-hard. The book also discusses some advanced topics in probabilistic data management such as top-k query processing, sequential probabilistic databases, indexing and materialized views, and Monte Carlo databases.

[1]  Jennifer Widom,et al.  Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[3]  Frank Neven,et al.  Typechecking Top-Down Uniform Unranked Tree Transducers , 2003, ICDT.

[4]  Christoph E. Koch MayBMS: A System for Managing Large Uncertain and Probabilistic Databases , 2009 .

[5]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[6]  Dan Olteanu,et al.  MayBMS: a probabilistic database management system , 2009, SIGMOD Conference.

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  J. Hintikka Semantics for Propositional Attitudes , 1969 .

[9]  Jennifer Widom,et al.  Representing uncertain data: models, properties, and algorithms , 2009, The VLDB Journal.

[10]  S. Madden,et al.  UPI: A Primary Index for Uncertain Databases , 2010, Proc. VLDB Endow..

[11]  J. Scott Provan,et al.  The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected , 1983, SIAM J. Comput..

[12]  Susanne E. Hambrusch,et al.  Orion 2.0: native support for uncertain data , 2008, SIGMOD Conference.

[13]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[14]  Dan Suciu,et al.  A formal analysis of information disclosure in data exchange , 2007, J. Comput. Syst. Sci..

[15]  Maurice van Keulen,et al.  Qualitative effects of knowledge rules and user feedback in probabilistic data integration , 2009, The VLDB Journal.

[16]  Ingo Wegener,et al.  BDDs--design, analysis, complexity, and applications , 2004, Discret. Appl. Math..

[17]  David Poole,et al.  Probabilistic Horn Abduction and Bayesian Networks , 1993, Artif. Intell..

[18]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[19]  Amol Deshpande,et al.  Indexing correlated probabilistic databases , 2009, SIGMOD Conference.

[20]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[21]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[22]  Lise Getoor,et al.  PrDB: managing and exploiting rich correlations in probabilistic databases , 2009, The VLDB Journal.

[23]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[24]  Jeffrey Xu Yu,et al.  Probabilistic Skyline Operator over Sliding Windows , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[25]  Feifei Li,et al.  Probabilistic string similarity joins , 2010, SIGMOD Conference.

[26]  Christopher Ré,et al.  Approximate lineage for probabilistic databases , 2008, Proc. VLDB Endow..

[27]  Yehoshua Sagiv,et al.  Query efficiency in probabilistic XML models , 2008, SIGMOD Conference.

[28]  Lise Getoor,et al.  Read-once functions and query evaluation in probabilistic databases , 2010, Proc. VLDB Endow..

[29]  Daisy Zhe Wang,et al.  Querying probabilistic information extraction , 2010, Proc. VLDB Endow..

[30]  Val Tannen,et al.  Faster query answering in probabilistic databases using read-once functions , 2010, ICDT '11.

[31]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[32]  Tova Milo,et al.  Deriving probabilistic databases with inference ensembles , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[33]  Dan Olteanu,et al.  Conditioning probabilistic databases , 2008, Proc. VLDB Endow..

[34]  Anthony K. H. Tung,et al.  Efficient and effective similarity search over probabilistic data based on Earth Mover’s Distance , 2010, The VLDB Journal.

[35]  Jian Li,et al.  Consensus answers for queries over probabilistic databases , 2008, PODS.

[36]  Bertram Ludäscher,et al.  A Transducer-Based XML Query Processor , 2002, VLDB.

[37]  Xi Zhang,et al.  On the semantics and evaluation of top-k queries in probabilistic databases , 2008, ICDE Workshops.

[38]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[39]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[40]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[41]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[42]  David Poole,et al.  First-order probabilistic inference , 2003, IJCAI.

[43]  Richard M. Karp,et al.  Monte-Carlo algorithms for enumeration and reliability problems , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[44]  Susanne E. Hambrusch,et al.  Indexing Uncertain Categorical Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[45]  H. V. Jagadish,et al.  ProTDB: Probabilistic Data in XML , 2002, VLDB.

[46]  Dan Suciu,et al.  Bridging the gap between intensional and extensional query evaluation in probabilistic databases , 2010, EDBT '10.

[47]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[48]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[49]  Luca Trevisan A Note on Deterministic Approximate Counting for k-DNF , 2002, Electron. Colloquium Comput. Complex..

[50]  George Kollios,et al.  k-nearest neighbors in uncertain graphs , 2010, Proc. VLDB Endow..

[51]  Ihab F. Ilyas,et al.  Supporting ranking queries on uncertain and incomplete data , 2010, The VLDB Journal.

[52]  Lise Getoor,et al.  Exploiting shared correlations in probabilistic databases , 2008, Proc. VLDB Endow..

[53]  Subbarao Kambhampati,et al.  Query processing over incomplete autonomous databases: query rewriting using learned data dependencies , 2009, The VLDB Journal.

[54]  Christopher Ré,et al.  Query Evaluation on Probabilistic Databases , 2006, IEEE Data Eng. Bull..

[55]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.

[56]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[57]  Amol Deshpande,et al.  Lineage processing over correlated probabilistic databases , 2010, SIGMOD Conference.

[58]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[59]  Dan Olteanu,et al.  SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[60]  Christopher Ré,et al.  The trichotomy of HAVING queries on a probabilistic database , 2009, The VLDB Journal.

[61]  Christoph Koch,et al.  On Query Algebras for Probabilistic Databases , 2009, SGMD.

[62]  Dan Suciu,et al.  Access control over uncertain data , 2008, Proc. VLDB Endow..

[63]  Christoph Koch,et al.  PIP: A database system for great and small expectations , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[64]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[65]  Limsoon Wong,et al.  Semantic representations and query languages for or-sets , 1993, PODS '93.

[66]  Renée J. Miller,et al.  Creating probabilistic databases from duplicated data , 2009, The VLDB Journal.

[67]  Dan Suciu,et al.  Probabilistic Event Extraction from RFID Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[68]  Christoph Koch,et al.  Approximating predicates and expressive queries on probabilistic databases , 2008, PODS.

[69]  Lawrence K. Saul,et al.  Large Margin Hidden Markov Models for Automatic Speech Recognition , 2006, NIPS.

[70]  Serge Abiteboul,et al.  On the complexity of managing probabilistic XML data , 2007, PODS '07.

[71]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[72]  Dan Olteanu,et al.  Approximate confidence computation in probabilistic databases , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[73]  Charalambos A. Charalambides,et al.  Enumerative combinatorics , 2018, SIGA.

[74]  Peter J. Haas,et al.  Resolution-Aware Query Answering for Business Intelligence , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[75]  Feifei Li,et al.  Ranking distributed probabilistic data , 2009, SIGMOD Conference.

[76]  Dan Olteanu,et al.  Using OBDDs for Efficient Query Evaluation on Probabilistic Databases , 2008, SUM.

[77]  Daisy Zhe Wang,et al.  Declarative Information Extraction in a Probabilistic Database System , 2009 .

[78]  Sunil Prabhakar,et al.  Threshold query optimization for uncertain data , 2010, SIGMOD Conference.

[79]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[80]  Jennifer Widom,et al.  Schema Design for Uncertain Databases , 2007, AMW.

[81]  Dan Olteanu,et al.  Secondary-storage confidence computation for conjunctive queries with inequalities , 2009, SIGMOD Conference.

[82]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[83]  Judea Pearl,et al.  Causal networks: semantics and expressiveness , 2013, UAI.

[84]  M. Mitzenmacher,et al.  Probability and Computing: Chernoff Bounds , 2005 .

[85]  Esteban Zimányi,et al.  Query Evaluation in Probabilistic Relational Databases , 1997, Theor. Comput. Sci..

[86]  Kevin Chen-Chuan Chang,et al.  Probabilistic top-k and ranking-aggregate queries , 2008, TODS.

[87]  Jian Li,et al.  A unified approach to ranking in probabilistic databases , 2009, The VLDB Journal.

[88]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[89]  Andrew McCallum,et al.  Scalable probabilistic databases with factor graphs and MCMC , 2010, Proc. VLDB Endow..

[90]  Gösta Grahne,et al.  Dependency Satisfaction in Databases with Incomplete Information , 1984, VLDB.

[91]  Prashant J. Shenoy,et al.  Probabilistic Inference over RFID Streams in Mobile Environments , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[92]  Saul A. Kripke,et al.  Semantical Analysis of Modal Logic I Normal Modal Propositional Calculi , 1963 .

[93]  Yehoshua Sagiv,et al.  Query evaluation over probabilistic XML , 2009, The VLDB Journal.

[94]  Jianzhong Li,et al.  Finding top-k maximal cliques in an uncertain graph , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[95]  Ezio Lefons,et al.  An Analytic Approach to Statistical Databases , 1983, VLDB.

[96]  Christopher Ré,et al.  Access Methods for Markovian Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[97]  Christopher Ré,et al.  Approximation trade-offs in Markovian stream processing: An empirical study , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[98]  Parag Agrawal,et al.  Towards Special-Purpose Indexes and Statistics for Uncertain Data , 2008, QDB/MUD.

[99]  Richard M. Karp,et al.  Monte-Carlo Approximation Algorithms for Enumeration Problems , 1989, J. Algorithms.

[100]  Christopher Ré,et al.  Implementing NOT EXISTS Predicates over a Probabilistic Database , 2008, QDB/MUD.

[101]  Christoph Koch,et al.  World-set decompositions: Expressiveness and efficient algorithms , 2007, Theor. Comput. Sci..

[102]  Jeffrey D. Ullman,et al.  Principles of Database and Knowledge-Base Systems, Volume II , 1988, Principles of computer science series.

[103]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[104]  Daphne Koller,et al.  Probabilistic Relational Models , 1999, ILP.

[105]  Gösta Grahne,et al.  The Problem of Incomplete Information in Relational Databases , 1991, Lecture Notes in Computer Science.

[106]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..