Management of data with uncertainties

Probabilistic databases, Query processing Since their invention in the early 70s, relational databases have been deterministic. They were designed to support applications s.a. accounting, inventory, customer care, and manufacturing, and these applications require a precise semantics. Thus, database systems are deterministic. A row is either in the database or is not; a tuple is either in the query answer or is not. The foundations of query processing and the tools that exists today for managing data rely fundamentally on the assumption that the data is deterministic. Increasingly, today we need to manage data that is uncertain. The uncertainty can be in the data itself, in the schema, in the mapping between different data instances, or in the user query. We find increasingly large amounts of uncertain data in a variety of domains: in data integration, in scientific data, in information extracted automatically from text, in data from the physical world. Large enterprises today can sometimes afford to cope with the uncertainty in their data by completely removing it, by using some expensive data cleaning or ETL tools. But increasingly today organizations or users need to cope directly with uncertain data, either because cleaning it is prohibitively expensive (e.g. in scientific data integration or in integration of Web data), or because it is even impossible to clean (e.g. sensor data or RFID data). It becomes clear that we need

[1]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[2]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[3]  Sunil Prabhakar,et al.  U-DBMS: A Database System for Managing Constantly-Evolving Data , 2005, VLDB.

[4]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[5]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[6]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[7]  Lada A. Adamic,et al.  How to search a social network , 2005, Soc. Networks.

[8]  Leslie G. Valiant,et al.  The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..

[9]  Barry Wellman,et al.  Challenges in Collecting Personal Network Data: The Nature of Personal Network Analysis , 2007 .

[10]  Sunita Sarawagi Automation in Information Extraction and Data Integration , 2002, VLDB.

[11]  Padhraic Smyth,et al.  Algorithms for estimating relative importance in networks , 2003, KDD '03.

[12]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[13]  Dan Suciu,et al.  The dichotomy of conjunctive queries on probabilistic structures , 2006, PODS.

[14]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[15]  Christopher Ré,et al.  Query Evaluation on Probabilistic Databases , 2006, IEEE Data Eng. Bull..

[16]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[17]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[18]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[19]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[20]  Michael Pittarelli,et al.  The Theory of Probabilistic Databases , 1987, VLDB.

[21]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[22]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[23]  Yuri Gurevich,et al.  The complexity of query reliability , 1998, PODS.

[24]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[25]  Dan Suciu,et al.  Answering Queries from Statistics and Probabilistic Views , 2005, VLDB.

[26]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[27]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[28]  B. Wellman The Development of Social Network Analysis: A Study in the Sociology of Science , 2008 .

[29]  Dan Suciu,et al.  Asymptotic Conditional Probabilities for Conjunctive Queries , 2005, ICDT.

[30]  Dan Olteanu,et al.  MayBMS: Managing Incomplete Information with Probabilistic World-Set Decompositions , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[31]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[32]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[33]  Barry Wellman,et al.  Visualizing Personal Networks: Working with Participant-aided Sociograms , 2007 .

[34]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[35]  Danah Boyd,et al.  Friendster and publicly articulated social networking , 2004, CHI EA '04.

[36]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[37]  Ashwin Machanavajjhala,et al.  On the efficiency of checking perfect privacy , 2006, PODS '06.

[38]  Jennifer Widom,et al.  An Introduction to ULDBs and the Trio System , 2006, IEEE Data Eng. Bull..

[39]  Christopher Ré,et al.  Managing Uncertainty in Social Networks , 2007, IEEE Data Eng. Bull..

[40]  Dan Suciu,et al.  A formal analysis of information disclosure in data exchange , 2004, SIGMOD '04.

[41]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[42]  Sunil Prabhakar,et al.  Managing Uncertainty in Sensor Databases , 2003 .

[43]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[44]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[45]  Matt Brown,et al.  Invited talk , 2007 .

[46]  Esteban Zimányi,et al.  Query Evaluation in Probabilistic Relational Databases , 1997, Theor. Comput. Sci..

[47]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[48]  Matthai Philipose,et al.  Towards Activity Databases: Using Sensors and Statistical Models to Summarize People's Lives , 2006, IEEE Data Eng. Bull..

[49]  Sunil Prabhakar,et al.  Managing uncertainty in sensor database , 2003, SGMD.

[50]  Caroline Haythornthwaite,et al.  Studying Online Social Networks , 2006, J. Comput. Mediat. Commun..

[51]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[52]  Lise Getoor,et al.  An Introduction to Probabilistic Graphical Models for Relational Data , 2006, IEEE Data Eng. Bull..

[53]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[54]  Christopher Ré,et al.  Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization , 2007, VLDB.

[55]  Alon Y. Halevy,et al.  Data integration with uncertainty , 2007, The VLDB Journal.

[56]  Magdalena Balazinska,et al.  Challenges for Pervasive RFID-Based Infrastructures , 2007, Fifth Annual IEEE International Conference on Pervasive Computing and Communications Workshops (PerComW'07).

[57]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.