Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations

This work introduces novel polynomial algorithms for processing top-k queries in uncertain databases under the generally adopted model of x-relations. An x-relation consists of a number of x-tuples, and each x-tuple randomly instantiates into one tuple from one or more alternatives. Our results significantly improve the best known algorithms for top-k query processing in uncertain databases, in terms of both runtime and memory usage. In the single-alternative case, the new algorithms are 2 to 3 orders of magnitude faster than the previous algorithms. In the multialternative case, we introduce the first-known polynomial algorithms, while the current best algorithms have exponential complexity in both time and space. Our algorithms run in near linear or low polynomial time and cover both types of top-k queries in uncertain databases. We provide both the theoretical analysis and an extensive experimental evaluation to demonstrate the superiority of the new approaches over existing solutions.

[1]  Feifei Li,et al.  Efficient Processing of Top-k Queries in Uncertain Databases with x-Relations , 2008, IEEE Transactions on Knowledge and Data Engineering.

[2]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[3]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[4]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[5]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[6]  Serge Abiteboul,et al.  On the representation and querying of sets of possible worlds , 1987, SIGMOD '87.

[7]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[9]  Dan Olteanu,et al.  From complete to incomplete information and back , 2007, SIGMOD '07.

[10]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[11]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[12]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[13]  Hector Garcia-Molina,et al.  The Management of Probabilistic Data , 1992, IEEE Trans. Knowl. Data Eng..

[14]  Dan Olteanu,et al.  10106 Worlds and Beyond: Efficient Representation and Processing of Incomplete Information , 2007, ICDE.

[15]  Parag Agrawal,et al.  Trio: a system for data, uncertainty, and lineage , 2006, VLDB.

[16]  Ambuj K. Singh,et al.  APLA: Indexing Arbitrary Probability Distributions , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[17]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[18]  Renée J. Miller,et al.  ConQuer: efficient management of inconsistent databases , 2005, SIGMOD '05.

[19]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[20]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[21]  Kevin Chen-Chuan Chang,et al.  RankSQL: query algebra and optimization for relational top-k queries , 2005, SIGMOD '05.

[22]  Susanne E. Hambrusch,et al.  Indexing Uncertain Categorical Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[23]  Wei Hong,et al.  Model-Driven Data Acquisition in Sensor Networks , 2004, VLDB.

[24]  Christopher Ré,et al.  Efficient Top-k Query Evaluation on Probabilistic Data , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[25]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[26]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[27]  Norbert Fuhr,et al.  A Probabilistic Framework for Vague Queries and Imprecise Information in Databases , 1990, VLDB.