Scalable Query Evaluation over Complex Probabilistic Databases

The age of Big Data has brought with itself datasets which are not just big, but also much more complicated. These datasets are constructed from disparate, unreliable and noisy sources, many times in an ad-hoc way because careful data cleaning and integration is too time consuming and not always necessary anymore. Representing the uncertainty hidden in these datasets is necessary to get meaningful query answers and Probabilistic Databases have come up as arguably the most popular solution to this problem. Their application to practical problems though has been held back because (i) the common models they use are not rich enough to capture the dependencies in these problems, and (ii) unlike traditional databases, query evaluation for probabilistic databases can be very expensive and unpredictable. This dissertation addresses these challenges by first proposing a new model for probabilistic databases that is rich enough to capture the dependencies found in most practical applications, while still allowing for a translation to considerably simpler and well-studied models. Our model leverages existing models from AI literature that combine probability theory with logic. The main challenge of query evaluation over probabilistic databases is that it requires solving probabilistic inference which is a notoriously hard problem. This dissertation studies this problem via both (i) foundational results that give new theoretical insights about existing probabilistic inference algorithms, like Read-Once Formulas, Tree-Decompositions, Binary Decision Diagrams, Negation Normal Forms, when applied to the setting of probabilistic databases, which as we will see have their own distinct challenges and expectations, and (ii) building a robust system where the above ideas are leveraged for efficient and reliable query evaluation.

[1]  Pierre Marquis,et al.  A Knowledge Compilation Map , 2002, J. Artif. Intell. Res..

[2]  Marco Cadoli,et al.  A Survey on Knowledge Compilation , 1997, AI Commun..

[3]  Vibhav Gogate,et al.  Advances in Lifted Importance Sampling , 2012, AAAI.

[4]  Christopher Ré,et al.  Probabilistic databases , 2011, SIGA.

[5]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[6]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.

[7]  Dan Olteanu,et al.  Secondary-storage confidence computation for conjunctive queries with inequalities , 2009, SIGMOD Conference.

[8]  Amol Deshpande,et al.  Lineage processing over correlated probabilistic databases , 2010, SIGMOD Conference.

[9]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[10]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[11]  Adnan Darwiche On the tractable counting of theory models and its application to belief revision and truth maintenance , 2000, ArXiv.

[12]  Dan Suciu,et al.  Bridging the gap between intensional and extensional query evaluation in probabilistic databases , 2010, EDBT '10.

[13]  Beate Bollig,et al.  Complexity Theoretical Results on Partitioned (Nondeterministic) Binary Decision Diagrams , 1997, MFCS.

[14]  Ben Taskar,et al.  Introduction to statistical relational learning , 2007 .

[15]  Eliezer L. Lozinskii,et al.  The Good Old Davis-Putnam Procedure Helps Counting Models , 2011, J. Artif. Intell. Res..

[16]  Pedro M. Domingos,et al.  Unsupervised Ontology Induction from Text , 2010, ACL.

[17]  Anna Gál,et al.  A Simple Function that Requires Exponential Size Read-Once Branching Programs , 1995, Inf. Process. Lett..

[18]  Michael I. Jordan,et al.  Probabilistic Networks and Expert Systems , 1999 .

[19]  Daisy Zhe Wang,et al.  Hybrid in-database inference for declarative information extraction , 2011, SIGMOD '11.

[20]  Moshe Y. Vardi,et al.  Treewidth in Verification: Local vs. Global , 2005, LPAR.

[21]  Phokion G. Kolaitis,et al.  Conjunctive-Query Containment and Constraint Satisfaction , 2000, J. Comput. Syst. Sci..

[22]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[23]  Oren Etzioni,et al.  Structured Querying of Web Text Data: A Technical Challenge , 2007, CIDR.

[24]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[25]  Pedro M. Domingos,et al.  Learning the structure of Markov logic networks , 2005, ICML.

[26]  Andrew McCallum,et al.  Scalable probabilistic databases with factor graphs and MCMC , 2010, Proc. VLDB Endow..

[27]  Adnan Darwiche,et al.  Modeling and Reasoning with Bayesian Networks , 2009 .

[28]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[29]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[30]  Udi Rotics,et al.  Factoring and recognition of read-once functions using cographs and normality and the readability of functions associated with partial k-trees , 2006, Discret. Appl. Math..

[31]  Enrico Macii,et al.  Algebric Decision Diagrams and Their Applications , 1997, ICCAD '93.

[32]  Dan Olteanu,et al.  Conditioning probabilistic databases , 2008, Proc. VLDB Endow..

[33]  Dan Suciu,et al.  Computing query probability with incidence algebras , 2010, PODS '10.

[34]  Dan Olteanu,et al.  Fast and Simple Relational Processing of Uncertain Data , 2007, 2008 IEEE 24th International Conference on Data Engineering.

[35]  Trevor Darrell,et al.  Conditional Random Fields for Object Recognition , 2004, NIPS.

[36]  Ashok K. Chandra,et al.  Optimal implementation of conjunctive queries in relational data bases , 1977, STOC '77.

[37]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[38]  Pedro M. Domingos,et al.  Sound and Efficient Inference with Probabilistic and Deterministic Dependencies , 2006, AAAI.

[39]  Bruno Courcelle,et al.  On the fixed parameter complexity of graph enumeration problems definable in monadic second-order logic , 2001, Discret. Appl. Math..

[40]  Christoph Meinel,et al.  Efficient Boolean Manipulation With OBDD's can be Extended to FBDD's , 1994, IEEE Trans. Computers.

[41]  Georg Gottlob,et al.  Bounded treewidth as a key to tractability of knowledge representation and reasoning , 2006, Artif. Intell..

[42]  Adnan Darwiche,et al.  Using DPLL for Efficient OBDD Construction , 2004, SAT.

[43]  Daisy Zhe Wang,et al.  Uncovering the Relational Web , 2008, WebDB.

[44]  Dan Suciu,et al.  The dichotomy of conjunctive queries on probabilistic structures , 2006, PODS.

[45]  Arie M. C. A. Koster,et al.  Combinatorial Optimization on Graphs of Bounded Treewidth , 2008, Comput. J..

[46]  Dan Suciu,et al.  Probabilistic Databases with MarkoViews , 2012, Proc. VLDB Endow..

[47]  Derek G. Corneil,et al.  Complexity of finding embeddings in a k -tree , 1987 .

[48]  Fabio Somenzi,et al.  CUDD: CU Decision Diagram Package Release 2.2.0 , 1998 .

[49]  Hans L. Bodlaender,et al.  A linear time algorithm for finding tree-decompositions of small treewidth , 1993, STOC.

[50]  Ingo Wegener,et al.  Graph Driven BDDs - A New Data Structure for Boolean Functions , 1995, Theor. Comput. Sci..

[51]  Dan Suciu,et al.  Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases , 2013, MUD.

[52]  Mihalis Yannakakis,et al.  Equivalences Among Relational Expressions with the Union and Difference Operators , 1980, J. ACM.

[53]  Kun Li,et al.  The MADlib Analytics Library or MAD Skills, the SQL , 2012, Proc. VLDB Endow..

[54]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[55]  Dan Olteanu,et al.  SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[56]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[57]  Randal E. Bryant,et al.  On the Complexity of VLSI Implementations and Graph Representations of Boolean Functions with Application to Integer Multiplication , 1991, IEEE Trans. Computers.

[58]  Samuel D. Johnson Branching programs and binary decision diagrams: theory and applications by Ingo Wegener society for industrial and applied mathematics, 2000 408 pages , 2010, SIGA.

[59]  Dan Suciu,et al.  On the tractability of query compilation and bounded treewidth , 2012, ICDT '12.

[60]  Ingo Wegener,et al.  BDDs--design, analysis, complexity, and applications , 2004, Discret. Appl. Math..

[61]  Dan Suciu,et al.  Query evaluation with soft-key constraints , 2008, PODS.

[62]  Matthew Richardson,et al.  Markov Logic , 2008, Probabilistic Inductive Logic Programming.

[63]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[64]  Venkat Chandrasekaran,et al.  Complexity of Inference in Graphical Models , 2008, UAI.

[65]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[66]  Sofia Cassel,et al.  Graph-Based Algorithms for Boolean Function Manipulation , 2012 .

[67]  Dan Suciu,et al.  Lifted Inference Seen from the Other Side : The Tractable Features , 2010, NIPS.

[68]  Mark Burgin,et al.  Interpretations of Negative Probabilities , 2010, 1008.1287.

[69]  Paolo Frasconi,et al.  Prediction of protein beta-residue contacts by Markov logic networks with grounding-specific weights , 2009, Bioinform..

[70]  Christopher Ré,et al.  Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS , 2011, Proc. VLDB Endow..

[71]  Lise Getoor,et al.  Read-once functions and query evaluation in probabilistic databases , 2010, Proc. VLDB Endow..

[72]  Daisy Zhe Wang,et al.  Querying probabilistic information extraction , 2010, Proc. VLDB Endow..

[73]  Val Tannen,et al.  Faster query answering in probabilistic databases using read-once functions , 2010, ICDT '11.

[74]  Pedro M. Domingos,et al.  Discriminative Training of Markov Logic Networks , 2005, AAAI.

[75]  Pedro M. Domingos,et al.  Markov Logic: An Interface Layer for Artificial Intelligence , 2009, Markov Logic: An Interface Layer for Artificial Intelligence.

[76]  Randal E. Bryant,et al.  Symbolic Manipulation of Boolean Functions Using a Graphical Representation , 1985, 22nd ACM/IEEE Design Automation Conference.

[77]  Leslie G. Valiant,et al.  The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..

[78]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[79]  Olivier Dubois,et al.  Counting the Number of Solutions for Instances of Satisfiability , 1991, Theor. Comput. Sci..

[80]  H. V. Jagadish,et al.  ProTDB: Probabilistic Data in XML , 2002, VLDB.

[81]  Johann A. Makowsky,et al.  Counting truth assignments of formulas of bounded tree-width or clique-width , 2008, Discret. Appl. Math..

[82]  Todd J. Green,et al.  Containment of Conjunctive Queries on Annotated Relations , 2009, ICDT '09.

[83]  Dan Olteanu,et al.  Approximate confidence computation in probabilistic databases , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[84]  Beate Bollig,et al.  A Very Simple Function that Requires Exponential Size Read-Once Branching Programs , 1998, Inf. Process. Lett..

[85]  Val Tannen,et al.  Provenance for database transformations , 2008, EDBT '10.

[86]  Dan Olteanu,et al.  Using OBDDs for Efficient Query Evaluation on Probabilistic Databases , 2008, SUM.