k-Selection Query over Uncertain Data

This paper studies a new query on uncertain data, called k-selection query. Given an uncertain dataset of N objects, where each object is associated with a preference score and a presence probability, a k-selection query returns k objects such that the expected score of the “best available” objects is maximized. This query is useful in many application domains such as entity web search and decision making. In evaluating k-selection queries, we need to compute the expected best score (EBS) for candidate k-selection sets and search for the optimal selection set with the highest EBS. Those operations are costly due to the extremely large search space. In this paper, we identify several important properties of k-selection queries, including EBS decomposition, query recursion, and EBS bounding. Based upon these properties, we first present a dynamic programming (DP) algorithm that answers the query in O(k·N) time. Further, we propose a Bounding-and-Pruning (BP) algorithm, that exploits effective search space pruning strategies to find the optimal selection without accessing all objects. We evaluate the DP and BP algorithms using both synthetic and real data. The results show that the proposed algorithms outperform the baseline approach by several orders of magnitude.

[1]  Jeffrey Xu Yu,et al.  Sliding-window top-k queries on uncertain streams , 2008, The VLDB Journal.

[2]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[3]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[4]  Ihab F. Ilyas,et al.  Efficient search for the top-k probable nearest neighbors in uncertain databases , 2008, Proc. VLDB Endow..

[5]  Jian Pei,et al.  Probabilistic Reverse Nearest Neighbor Queries on Uncertain Data , 2010, IEEE Transactions on Knowledge and Data Engineering.

[6]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[7]  Sunil Prabhakar,et al.  Querying imprecise data in moving object environments , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[8]  Jeffrey Xu Yu,et al.  Probabilistic Skyline Operator over Sliding Windows , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[9]  Hans-Peter Kriegel,et al.  Probabilistic Nearest-Neighbor Query on Uncertain Objects , 2007, DASFAA.

[10]  Chi-Yin Chow,et al.  Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Norbert Fuhr,et al.  A probabilistic relational algebra for the integration of information retrieval and database systems , 1997, TOIS.

[12]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[13]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[14]  Laks V. S. Lakshmanan,et al.  ProbView: a flexible probabilistic database system , 1997, TODS.

[15]  Feifei Li,et al.  Semantics of Ranking Queries for Probabilistic Data and Expected Ranks , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[16]  Prithviraj Sen,et al.  Representing and Querying Correlated Tuples in Probabilistic Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[17]  Reynold Cheng,et al.  Evaluating probability threshold k-nearest-neighbor queries over uncertain data , 2009, EDBT '09.

[18]  Serge Abiteboul,et al.  On the Representation and Querying of Sets of Possible Worlds , 1991, Theor. Comput. Sci..