A Framework for Ranking and KNN Queries in a Probabilistic Skyline Model

Skyline computation has gained a lot of attention in recent years. According to the definition of skyline, objects that belong to skyline cannot be ranked among themselves because they are incomparable. This constraint limits the application of skyline. Fortunately, due to the recently proposed probabilistic skyline model, skyline objects which contain multiple elements, can now be compared with each others. Different from the traditional skyline model where each object can either be a skyline object or not, in the probabilistic skyline model, each object is assigned a skyline probability to denote its likelihood of being a skyline object. Under this model, two simple but important questions will naturally be asked: (1) Given an object, which of the objects are the K nearest neighbors to it based on their skyline probabilities? (2) Given an object, what is the ranking of the objects which have skyline probabilities greater than the given object? To the best of our knowledge, no existing work can effectively answer these two questions. Yet, answering them is not trivial. For a medium-size dataset (e.g. 10,000 objects), it may take more than an hour to compute the skyline probabilities of all objects. In this paper, we propose a novel framework to answering the above two questions on the fly efficiently. Our proposed work is based on the idea of bounding-pruning-refining strategy. We first compute the skyline probabilities of the target object and all its elements. For the rest of the objects, instead of computing their accurate skyline probabilities, we compute the upper bound and lower bound skyline probabilities using the elements of the target object. Based on lower bound and upper bound of their skyline probabilities, some objects, which cannot be in the result, will be pruned. For those objects, which we are unknown whether they are in the results or not, we need to refine their bounds. The refinement strategy is based on the idea of space partition. Specifically, we first partition the whole dataspace into several subspaces based on the distribution of elements in the target object. When we iteratively do the the refinement of the bounds, we will do the partitioning strategy in each subspace. In order to implement this framework, a novel tree, called Space Partition Tree (SPTree) is proposed to index the objects and their elements. We evaluate our proposed work using three synthetic datasets and one real-life dataset. We report all our findings in this paper.

[1]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[2]  Qing Liu,et al.  Efficient Computation of the Skyline Cube , 2005, VLDB.

[3]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[4]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[5]  Xiang Lian,et al.  Dynamic skyline queries in metric spaces , 2008, EDBT '08.

[6]  Jan Chomicki,et al.  Skyline with presorting , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[7]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[8]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[9]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[10]  Bernhard Seeger,et al.  Efficient Computation of Reverse Skyline Queries , 2007, VLDB.

[11]  Beng Chin Ooi,et al.  Skyline Queries Against Mobile Lightweight Devices in MANETs , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[12]  Jian Pei,et al.  SUBSKY: Efficient Computation of Skylines in Subspaces , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[13]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[14]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[15]  Kyriakos Mouratidis,et al.  Continuous monitoring of top-k queries over sliding windows , 2006, SIGMOD Conference.

[16]  Yufei Tao,et al.  Maintaining sliding window skylines on data streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[17]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[18]  Masatoshi Yoshikawa,et al.  The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation , 2000, VLDB.

[19]  Tian Xia,et al.  Refreshing the sky: the compressed skycube with efficient support for frequent updates , 2006, SIGMOD Conference.

[20]  Donald Kossmann,et al.  Shooting Stars in the Sky: An Online Algorithm for Skyline Queries , 2002, VLDB.

[21]  Xuemin Lin,et al.  Selecting Stars: The k Most Representative Skyline Operator , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[22]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[23]  Hongjun Lu,et al.  Stabbing the sky: efficient skyline computation over sliding windows , 2005, 21st International Conference on Data Engineering (ICDE'05).

[24]  Bernhard Seeger,et al.  An optimal and progressive algorithm for skyline queries , 2003, SIGMOD '03.

[25]  Beng Chin Ooi,et al.  Efficient Progressive Skyline Computation , 2001, VLDB.

[26]  Anthony K. H. Tung,et al.  Finding k-dominant skylines in high dimensional space , 2006, SIGMOD Conference.

[27]  Heng Tao Shen,et al.  Multi-source Skyline Query Processing in Road Networks , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[28]  Jian Pei,et al.  Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces , 2005, VLDB.

[29]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.