Consensus answers for queries over probabilistic databases

We address the problem of finding a "best" deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes the expected distance to the possible worlds (answers). This problem can be seen as a generalization of the well-studied inconsistent information aggregation problems (e.g. rank aggregation) to probabilistic databases. We consider this problem for various types of queries including SPJ queries, Top-k ranking queries, group-by aggregate queries, and clustering. For different distance metrics, we obtain polynomial time optimal or approximation algorithms for computing the consensus answers (or prove NP-hardness). Most of our results are for a general probabilistic database model, called and/xor tree model, which significantly generalizes previous probabilistic database models like x-tuples and block-independent disjoint models, and is of independent interest.

[1]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[2]  Shirley Dex,et al.  JR 旅客販売総合システム(マルス)における運用及び管理について , 1991 .

[3]  Graham Cormode,et al.  Approximation algorithms for clustering uncertain data , 2008, PODS.

[4]  Ihab F. Ilyas,et al.  Efficient search for the top-k probable nearest neighbors in uncertain databases , 2008, Proc. VLDB Endow..

[5]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[6]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[7]  Xi Zhang,et al.  On the semantics and evaluation of top-k queries in probabilistic databases , 2008, ICDE Workshops.

[8]  Xi Zhang,et al.  Semantics and evaluation of top-k queries in probabilistic databases , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[9]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[10]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[11]  Feifei Li,et al.  Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations , 2008, IEEE Trans. Knowl. Data Eng..

[12]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[13]  Andrew McGregor,et al.  Estimating statistical aggregates on probabilistic data streams , 2008, TODS.

[14]  Nicolas de Condorcet Essai Sur L'Application de L'Analyse a la Probabilite Des Decisions Rendues a la Pluralite Des Voix , 2009 .

[15]  Dan Olteanu,et al.  From complete to incomplete information and back , 2007, SIGMOD '07.

[16]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[17]  Daisy Zhe Wang,et al.  BayesStore: managing large, uncertain data repositories with probabilistic graphical models , 2008, Proc. VLDB Endow..

[18]  S. Shapiro,et al.  Mathematics without Numbers , 1993 .

[19]  C. Dwork,et al.  Rank Aggregation Revisited , 2002 .

[20]  Yoshiko Wakabayashi The Complexity of Computing Medians of Relations , 1998 .

[21]  Graham Cormode,et al.  Histograms and Wavelets on Probabilistic Data , 2010, IEEE Trans. Knowl. Data Eng..

[22]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[23]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[24]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[25]  Jian Li,et al.  A unified approach to ranking in probabilistic databases , 2009, The VLDB Journal.

[26]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[27]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[28]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[29]  P.-C.-F. Daunou,et al.  Mémoire sur les élections au scrutin , 1803 .

[30]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[31]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[32]  Gösta Grahne Horn tables-an efficient tool for handling incomplete information in databases , 1989, PODS '89.

[33]  Val Tannen,et al.  Models for Incomplete and Probabilistic Information , 2006, IEEE Data Eng. Bull..

[34]  Jian Pei,et al.  Efficiently Answering Probabilistic Threshold Top-k Queries on Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[35]  Silvio Micali,et al.  An O(v|v| c |E|) algoithm for finding maximum matching in general graphs , 1980, 21st Annual Symposium on Foundations of Computer Science (sfcs 1980).

[36]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[37]  Ihab F. Ilyas,et al.  Ranking with Uncertain Scores , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[38]  Nir Ailon,et al.  Aggregation of Partial Rankings, p-Ratings and Top-m Lists , 2007, SODA '07.

[39]  Muhammad H. Alsuwaiyel,et al.  Algorithms - Design Techniques and Analysis , 1999, Lecture Notes Series on Computing.

[40]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[41]  Renée J. Miller,et al.  Clean Answers over Dirty Databases: A Probabilistic Approach , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[42]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .

[43]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[44]  Christopher Ré,et al.  Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization , 2007, VLDB.

[45]  J. Hodge,et al.  The Mathematics of Voting and Elections: A Hands-On Approach , 2005, Mathematical World.

[46]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[47]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[48]  Lise Getoor,et al.  Exploiting shared correlations in probabilistic databases , 2008, Proc. VLDB Endow..