Semantics and evaluation of top-k queries in probabilistic databases

We study here fundamental issues involved in top-k query evaluation in probabilistic databases. We consider simple probabilistic databases in which probabilities are associated with individual tuples, and general probabilistic databases in which, additionally, exclusivity relationships between tuples can be represented. In contrast to other recent research in this area, we do not limit ourselves to injective scoring functions. We formulate three intuitive postulates for the semantics of top-k queries in probabilistic databases, and introduce a new semantics, Global-Topk, that satisfies those postulates to a large degree. We also show how to evaluate queries under the Global-Topk semantics. For simple databases we design dynamic-programming based algorithms. For general databases we show polynomial-time reductions to the simple cases, and provide effective heuristics to speed up the computation in practice. For example, we demonstrate that for a fixed k the time complexity of top-k query evaluation is as low as linear, under the assumption that probabilistic databases are simple and scoring functions are injective.

[1]  Christoph Koch,et al.  World-set decompositions: Expressiveness and efficient algorithms , 2007, Theor. Comput. Sci..

[2]  Serge Abiteboul,et al.  Foundations of Databases: The Logical Level , 1995 .

[3]  Wendy Hui Wang,et al.  The Threshold Algorithm: From Middleware Systems to the Relational Engine , 2007, IEEE Transactions on Knowledge and Data Engineering.

[4]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[5]  Joseph Y. Halpern An Analysis of First-Order Logics of Probability , 1989, IJCAI.

[6]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[7]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[8]  FuhrNorbert,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997 .

[9]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[10]  Walid G. Aref,et al.  Supporting top-kjoin queries in relational databases , 2004, The VLDB Journal.

[11]  Venu Govindaraju,et al.  Biometrics Driven Smart Environments: Abstract Framework and Evaluation , 2008, UIC.

[12]  Esteban Zimányi,et al.  Query Evaluation in Probabilistic Relational Databases , 1997, Theor. Comput. Sci..

[13]  Kevin Chen-Chuan Chang,et al.  Probabilistic top-k and ranking-aggregate queries , 2008, TODS.

[14]  Jennifer Widom,et al.  Trio: A System for Integrated Management of Data, Accuracy, and Lineage , 2004, CIDR.

[15]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[16]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[17]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[18]  Dan Olteanu,et al.  World-Set Decompositions: Expressiveness and Efficient Algorithms , 2007, ICDT.

[19]  Feifei Li,et al.  Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations , 2008, IEEE Transactions on Knowledge and Data Engineering.

[20]  Sudipto Guha,et al.  Merging the Results of Approximate Match Operations , 2004, VLDB.

[21]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[22]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[23]  Michael Pittarelli,et al.  The Theory of Probabilistic Databases , 1987, VLDB.

[24]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[25]  John R. Smith,et al.  Supporting Incremental Join Queries on Ranked Inputs , 2001, VLDB.

[26]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[27]  Xi Zhang,et al.  Semantics and evaluation of top-k queries in probabilistic databases , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[28]  Walid G. Aref,et al.  Joining Ranked Inputs in Practice , 2002, VLDB.

[29]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[30]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.