Management of probabilistic data: foundations and challenges

Many applications today need to manage large data sets with uncertainties. In this paper we describe the foundations of managing data where the uncertainties are quantified as probabilities. We review the basic definitions of the probabilistic data model, present some fundamental theoretical result for query evaluation on probabilistic databases, and discuss several challenges, open problems, and research directions.

[1]  Dan Suciu,et al.  A formal analysis of information disclosure in data exchange , 2004, SIGMOD '04.

[2]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[3]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[4]  Sriram Raghavan,et al.  Avatar Information Extraction System , 2006, IEEE Data Eng. Bull..

[5]  Dan Suciu,et al.  Foundations of probabilistic answers to queries , 2005, SIGMOD '05.

[6]  Blake Hannaford,et al.  A Hybrid Discriminative/Generative Approach for Modeling Human Activities , 2005, IJCAI.

[7]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[8]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[9]  Lise Getoor,et al.  An Introduction to Probabilistic Graphical Models for Relational Data , 2006, IEEE Data Eng. Bull..

[10]  Matthai Philipose,et al.  Towards Activity Databases: Using Sensors and Statistical Models to Summarize People's Lives , 2006, IEEE Data Eng. Bull..

[11]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[12]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[13]  Dan Suciu,et al.  Towards correcting input data errors probabilistically using integrity constraints , 2006, MobiDE '06.

[14]  Richard M. Karp,et al.  Monte-Carlo algorithms for enumeration and reliability problems , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[15]  Cassio P. de Campos Tutorial: Graphical Models , 2009 .

[16]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[17]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[18]  Michel de Rougemont,et al.  The Reliability of Queries. , 1995, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[19]  Raghu Ramakrishnan,et al.  Community Information Management , 2006, IEEE Data Eng. Bull..

[20]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[21]  Christopher Ré,et al.  Query Evaluation on Probabilistic Databases , 2006, IEEE Data Eng. Bull..

[22]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[23]  Adnan Darwiche,et al.  A differential approach to inference in Bayesian networks , 2000, JACM.

[24]  Judea Pearl,et al.  Causal networks: semantics and expressiveness , 2013, UAI.

[25]  Samuel Madden,et al.  Using Probabilistic Models for Data Management in Acquisitional Environments , 2005, CIDR.

[26]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[27]  Paul Brown,et al.  CORDS: automatic discovery of correlations and soft functional dependencies , 2004, SIGMOD '04.

[28]  Xin Li,et al.  Constraint-Based Entity Matching , 2005, AAAI.

[29]  T. S. Jayram,et al.  Efficient aggregation algorithms for probabilistic data , 2007, SODA '07.

[30]  Joseph Y. Halpern,et al.  From Statistical Knowledge Bases to Degrees of Belief , 1996, Artif. Intell..

[31]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[32]  T. S. Jayram,et al.  Efficient allocation algorithms for OLAP over imprecise data , 2006, VLDB.

[33]  H. James Hoover,et al.  Limits to Parallel Computation: P-Completeness Theory , 1995 .

[34]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[35]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[36]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[37]  V. S. Subrahmanian,et al.  PXML: a probabilistic semistructured data model and algebra , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[38]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[39]  Serge Abiteboul,et al.  Querying and Updating Probabilistic Information in XML , 2006, EDBT.

[40]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[41]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[42]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[43]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[44]  Yuri Gurevich,et al.  The complexity of query reliability , 1998, PODS.

[45]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[46]  Rajeev Rastogi,et al.  Independence is good: dependency-based histogram synopses for high-dimensional data , 2001, SIGMOD '01.

[47]  Maurice van Keulen,et al.  A probabilistic XML approach to data integration , 2005, 21st International Conference on Data Engineering (ICDE'05).

[48]  Doug Downey,et al.  Web-scale information extraction in knowitall: (preliminary results) , 2004, WWW '04.

[49]  Dan Suciu,et al.  Answering Queries from Statistics and Probabilistic Views , 2005, VLDB.

[50]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[51]  Nilesh N. Dalvi Query Evaluation on a Database Given by a Random Graph , 2007, Theory of Computing Systems.

[52]  Adnan Darwiche,et al.  Functional Treewidth: Bounding Complexity in the Presence of Functional Dependencies , 2006, SAT.

[53]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[54]  Phokion G. Kolaitis Schema mappings, data exchange, and metadata management , 2005, PODS '05.

[55]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[56]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[57]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[58]  Christopher Ré,et al.  Applications of Probabilistic Constraints , 2007 .

[59]  Leslie G. Valiant,et al.  The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..

[60]  Dan Suciu,et al.  Asymptotic Conditional Probabilities for Conjunctive Queries , 2005, ICDT.

[61]  J. Scott Provan,et al.  The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected , 1983, SIAM J. Comput..

[62]  David Maier,et al.  Principles of dataspace systems , 2006, PODS '06.

[63]  Moshe Y. Vardi The complexity of relational query languages (Extended Abstract) , 1982, STOC '82.

[64]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[65]  Michael Luby,et al.  Approximating Probabilistic Inference in Bayesian Belief Networks is NP-Hard , 1993, Artif. Intell..

[66]  Jacob Köhler,et al.  Addressing the problems with life-science databases for traditional uses and systems biology , 2006, Nature Reviews Genetics.

[67]  Robert B. Ross,et al.  Aggregate operators in probabilistic databases , 2005, JACM.

[68]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[69]  Anastasia Ailamaki,et al.  Challenges inbuilding a DBMS Resource Advisor , 2006, IEEE Data Eng. Bull..

[70]  Christoph Koch,et al.  World-set decompositions: Expressiveness and efficient algorithms , 2007, Theor. Comput. Sci..

[71]  Minos N. Garofalakis,et al.  Adaptive cleaning for RFID data streams , 2006, VLDB.

[72]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[73]  Dan Suciu,et al.  The dichotomy of conjunctive queries on probabilistic structures , 2006, PODS.

[74]  Joseph Y. Halpern,et al.  From Statistical Knowledge Bases to Degrees of Belief , 1996, Artif. Intell..

[75]  Dan Olteanu,et al.  $${10^{(10^{6})}}$$ worlds and beyond: efficient representation and processing of incomplete information , 2006, 2007 IEEE 23rd International Conference on Data Engineering.

[76]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[77]  Ernest W. Adams,et al.  A primer of probability logic , 1996 .

[78]  H. James Hoover,et al.  Limits to parallel computation , 1995 .

[79]  Sunita Sarawagi Automation in Information Extraction and Data Integration , 2002, VLDB.

[80]  Dan Olteanu,et al.  10106 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information , 2007, ICDE.

[81]  Michel de Rougemont The reliability of queries (extended abstract) , 1995, PODS '95.

[82]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[83]  Alexander S. Szalay,et al.  Data Management in the Worldwide Sensor Web , 2007, IEEE Pervasive Computing.