Managing large-scale probabilistic databases

Modern applications are driven by data, and increasingly the data driving these applications are imprecise. The set of applications that generate imprecise data is diverse: In sensor database applications, the goal is to measure some aspect of the physical world (such as temperature in a region or a person's location). Such an application has no choice but to deal with imprecision, as measuring the physical world is inherently imprecise. In data integration, consider two databases that refer to the same set of real-world entities, but the way in which they refer to those entities is slightly different. For example, one database may contain an entity ‘J. Smith’ while the second database refers to ‘John Smith’. In such a scenario, the large size of the data makes it too costly to manually reconcile all references in the two databases. To lower the cost of integration, state-of-the-art approaches allow the data to be imprecise. In addition to applications which are forced to cope with imprecision, emerging data-driven applications, such as large-scale information extraction, natively produce and manipulate similarity scores. In all these domains, the current state-of-the-art approach is to allow the data to be imprecise and to shift the burden of coping with imprecision to applications. The thesis of this work is that it is possible to effectively manage large, imprecise databases using a generic approach based on probability theory. The key technical challenge in building such a general-purpose approach is performance, and the technical contributions of this dissertation are techniques for efficient evaluation over probabilistic databases. In particular, we demonstrate that it is possible to run complex SQL queries on tens of gigabytes of probabilistic data with performance that is comparable to a standard relational database engine.

[1]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[2]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[3]  Sharma Chakravarthy,et al.  SnoopIB: Interval-based event specification and detection for active databases , 2003, Data Knowl. Eng..

[4]  Stanley B. Zdonik,et al.  Top-k queries on uncertain data: on score distribution and typical answers , 2009, SIGMOD Conference.

[5]  Serge Abiteboul,et al.  Complexity of answering queries using materialized views , 1998, PODS.

[6]  Herbert B. Enderton,et al.  A mathematical introduction to logic , 1972 .

[7]  Jian Pei,et al.  Efficiently Answering Probabilistic Threshold Top-k Queries on Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[8]  Steven J. M. Jones,et al.  A SAGE Approach to Discovery of Genes Involved in Autophagic Cell Death , 2003, Current Biology.

[9]  Oren Etzioni,et al.  Machine Reading , 2006, AAAI.

[10]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[11]  Peter J. Haas,et al.  MCDB: a monte carlo approach to managing uncertain data , 2008, SIGMOD Conference.

[12]  Sriram Raghavan,et al.  Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[13]  Jian Li,et al.  A unified approach to ranking in probabilistic databases , 2009, The VLDB Journal.

[14]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[15]  Jennifer Widom,et al.  Exploiting Lineage for Confidence Computation in Uncertain and Probabilistic Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[17]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[18]  Jonathan Goldstein,et al.  Optimizing queries using materialized views: a practical, scalable solution , 2001, SIGMOD '01.

[19]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[20]  R. O'Donnell,et al.  Computational applications of noise sensitivity , 2003 .

[21]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[22]  Yuri Gurevich,et al.  The complexity of query reliability , 1998, PODS.

[23]  P. Beame A switching lemma primer , 1994 .

[24]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[25]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[26]  Xi Zhang,et al.  On the semantics and evaluation of top-k queries in probabilistic databases , 2008, ICDE Workshops.

[27]  Michael Pittarelli,et al.  The Theory of Probabilistic Databases , 1987, VLDB.

[28]  Renée J. Miller,et al.  First-order query rewriting for inconsistent databases , 2005, J. Comput. Syst. Sci..

[29]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[30]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[31]  Christoph Koch,et al.  Approximating predicates and expressive queries on probabilistic databases , 2008, PODS.

[32]  Rajasekar Krishnamurthy,et al.  Uncertainty management in rule-based information extraction systems , 2009, SIGMOD Conference.

[33]  Eyal Kushilevitz,et al.  Learning decision trees using the Fourier spectrum , 1991, STOC '91.

[34]  V. S. Subrahmanian,et al.  PXML: a probabilistic semistructured data model and algebra , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[35]  Jeffrey D. Ullman,et al.  Information integration using logical views , 1997, Theor. Comput. Sci..

[36]  Jennifer Widom,et al.  Databases with uncertainty and lineage , 2008, The VLDB Journal.

[37]  Prashant J. Shenoy,et al.  Probabilistic Inference over RFID Streams in Mobile Environments , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[38]  Gösta Grahne,et al.  The Problem of Incomplete Information in Relational Databases , 1991, Lecture Notes in Computer Science.

[39]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[40]  Margo I. Seltzer,et al.  Provenance-Aware Storage Systems , 2006, USENIX ATC, General Track.

[41]  James Cheney,et al.  Provenance management in curated databases , 2006, SIGMOD Conference.

[42]  Serge Abiteboul,et al.  On the complexity of managing probabilistic XML data , 2007, PODS '07.

[43]  Serge Abiteboul,et al.  Querying and Updating Probabilistic Information in XML , 2006, EDBT.

[44]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[45]  Gai CarSO A Logic for Reasoning about Probabilities * , 2004 .

[46]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[47]  Rajeev Rastogi,et al.  SPARTAN: a model-based semantic compression system for massive data tables , 2001, SIGMOD '01.

[48]  H. V. Jagadish,et al.  ProTDB: Probabilistic Data in XML , 2002, VLDB.

[49]  Martin E. Dyer,et al.  An approximation trichotomy for Boolean #CSP , 2010, J. Comput. Syst. Sci..

[50]  Dan Olteanu,et al.  SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[51]  Sarath Kumar Kondreddi,et al.  A Probabilistic XML Approach to Data Integration , 2009 .

[52]  Kyuseok Shim,et al.  Optimizing queries with materialized views , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[53]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[54]  Amol Deshpande,et al.  Indexing correlated probabilistic databases , 2009, SIGMOD Conference.

[55]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[56]  Xin He,et al.  Scalar aggregation in inconsistent databases , 2003, Theor. Comput. Sci..

[57]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[58]  Christopher Ré,et al.  Approximate lineage for probabilistic databases , 2008, Proc. VLDB Endow..

[59]  James Cheney,et al.  Curated databases , 2008, PODS.

[60]  Johannes Gehrke,et al.  Cayuga: a high-performance event processing engine , 2007, SIGMOD '07.

[61]  J. R. Shoenfield,et al.  Review: Herbert B. Enderton, A Mathematical Introduction to Logic , 1973 .

[62]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[63]  Robert B. Ross,et al.  Aggregate operators in probabilistic databases , 2005, JACM.

[64]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[65]  Anastasia Ailamaki,et al.  Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[66]  Dan Roth,et al.  Lifted First-Order Probabilistic Inference , 2005, IJCAI.

[67]  Yehoshua Sagiv,et al.  Query efficiency in probabilistic XML models , 2008, SIGMOD Conference.

[68]  Dan Suciu,et al.  The dichotomy of conjunctive queries on probabilistic structures , 2006, PODS.

[69]  Adriane Chapman,et al.  Issues in Building Practical Provenance Systems , 2007, IEEE Data Eng. Bull..

[70]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[71]  Joseph Y. Halpern,et al.  Defining Explanation in Probabilistic Systems , 1997, UAI.

[72]  Luca Trevisan A Note on Deterministic Approximate Counting for k-DNF , 2002, Electron. Colloquium Comput. Complex..

[73]  Patrick Valduriez,et al.  Join indices , 1987, TODS.

[74]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[75]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[76]  Christopher Ré,et al.  Managing Probabilistic Data with MystiQ : The Can-Do , the Could-Do , and the Can ’ t-Do ? , 2008 .

[77]  Christopher Ré,et al.  Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization , 2007, VLDB.

[78]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[79]  Jan Chomicki,et al.  Query Answering in Inconsistent Databases , 2003, Logics for Emerging Applications of Databases.

[80]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[81]  Richard M. Karp,et al.  Monte-Carlo algorithms for enumeration and reliability problems , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[82]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[83]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[84]  Christopher Ré,et al.  The trichotomy of HAVING queries on a probabilistic database , 2009, The VLDB Journal.

[85]  Rocco A. Servedio,et al.  On learning monotone DNF under product distributions , 2001, Inf. Comput..

[86]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[87]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[88]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[89]  Sumit Sarkar,et al.  Generalized Normal Forms for Probabilistic Relational Data , 2002, IEEE Trans. Knowl. Data Eng..

[90]  Moshe Y. Vardi On the integrity of databases with incomplete information , 1985, PODS.

[91]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[92]  Ashwin Machanavajjhala,et al.  On the efficiency of checking perfect privacy , 2006, PODS '06.

[93]  Noam Nisan,et al.  Constant depth circuits, Fourier transform, and learnability , 1993, JACM.

[94]  Nick Roussopoulos,et al.  Extended wavelets for multiple measures , 2003, SIGMOD '03.

[95]  Dan Olteanu,et al.  10106 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information , 2007, ICDE.

[96]  Jennifer Widom,et al.  An Introduction to ULDBs and the Trio System , 2006, IEEE Data Eng. Bull..

[97]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[98]  Jennifer Widom,et al.  Making Aggregation Work in Uncertain and Probabilistic Databases , 2011, IEEE Transactions on Knowledge and Data Engineering.

[99]  Leslie G. Valiant,et al.  The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..

[100]  Joseph Y. Halpern,et al.  Causes and Explanations: A Structural-Model Approach. Part II: Explanations , 2001, The British Journal for the Philosophy of Science.

[101]  Mark Jerrum,et al.  Approximate Counting, Uniform Generation and Rapidly Mixing Markov Chains , 1987, International Workshop on Graph-Theoretic Concepts in Computer Science.

[102]  Ronald Fagin,et al.  A logic for reasoning about probabilities , 1988, [1988] Proceedings. Third Annual Information Symposium on Logic in Computer Science.

[103]  Christopher Ré,et al.  Query Evaluation on Probabilistic Databases , 2006, IEEE Data Eng. Bull..

[104]  Serge Abiteboul,et al.  Foundations of Databases , 1994 .

[105]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[106]  Alon Y. Halevy,et al.  MiniCon: A scalable algorithm for answering queries using views , 2000, The VLDB Journal.

[107]  Amol Deshpande,et al.  Online Filtering, Smoothing and Probabilistic Modeling of Streaming data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[108]  Martin E. Dyer,et al.  On the relative complexity of approximate counting problems , 2000, APPROX.

[109]  Ben Taskar,et al.  Selectivity estimation using probabilistic models , 2001, SIGMOD '01.

[110]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[111]  Christopher Ré,et al.  Event queries on correlated probabilistic streams , 2008, SIGMOD Conference.

[112]  Pedro M. Domingos,et al.  Lifted First-Order Belief Propagation , 2008, AAAI.

[113]  Nader H. Bshouty,et al.  On the Fourier spectrum of monotone functions , 1995, STOC '95.

[114]  Ronald Fagin,et al.  Reasoning about knowledge and probability , 1988, JACM.

[115]  J. Håstad Computational limitations of small-depth circuits , 1987 .

[116]  Dan Suciu,et al.  A formal analysis of information disclosure in data exchange , 2004, SIGMOD '04.

[117]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[118]  Dan Olteanu,et al.  World-Set Decompositions: Expressiveness and Efficient Algorithms , 2007, ICDT.

[119]  Feifei Li,et al.  Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations , 2008, IEEE Transactions on Knowledge and Data Engineering.

[120]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[121]  Terhi Töyli,et al.  bdbms - A Database Management System for Biological Data , 2008 .

[122]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[123]  Val Tannen,et al.  ORCHESTRA: facilitating collaborative data sharing , 2007, SIGMOD '07.

[124]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[125]  Oded Schwartz,et al.  On the Hardness of Approximating k-Dimensional Matching , 2003, Electron. Colloquium Comput. Complex..

[126]  Yehoshua Sagiv,et al.  Running tree automata on probabilistic XML , 2009, PODS.

[127]  Dan Olteanu,et al.  MayBMS: a probabilistic database management system , 2009, SIGMOD Conference.

[128]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[129]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[130]  Michael I. Jordan,et al.  Probabilistic Networks and Expert Systems , 1999 .

[131]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[132]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[133]  Yehoshua Sagiv,et al.  Incorporating constraints in probabilistic XML , 2009, TODS.

[134]  Christopher Ré,et al.  Lahar Demonstration: Warehousing Markovian Streams , 2009, Proc. VLDB Endow..

[135]  Minos N. Garofalakis,et al.  Probabilistic wavelet synopses , 2004, TODS.

[136]  Samuel R. Buss,et al.  A Switching Lemma for Small Restrictions and Lower Bounds for k-DNF Resolution , 2004, SIAM J. Comput..

[137]  Yishay Mansour,et al.  Weakly learning DNF and characterizing statistical query learning using Fourier analysis , 1994, STOC '94.

[138]  Lise Getoor,et al.  Exploiting shared correlations in probabilistic databases , 2008, Proc. VLDB Endow..

[139]  Adnan Darwiche,et al.  A differential approach to inference in Bayesian networks , 2000, JACM.

[140]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[141]  Jonathan Goldstein,et al.  Compressing relations and indexes , 1998, Proceedings 14th International Conference on Data Engineering.

[142]  Jeffrey Scott Vitter,et al.  Approximate computation of multidimensional aggregates of sparse data using wavelets , 1999, SIGMOD '99.

[143]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[144]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[145]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[146]  Christopher Ré,et al.  Access Methods for Markovian Streams , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[147]  Christoph Koch,et al.  A compositional query algebra for second-order logic and uncertain databases , 2008, ICDT '09.

[148]  Blake Hannaford,et al.  A Hybrid Discriminative/Generative Approach for Modeling Human Activities , 2005, IJCAI.

[149]  Christopher Ré,et al.  Queries and materialized views on probabilistic databases , 2011, J. Comput. Syst. Sci..