Probabilistic skylines on uncertain data: model and bounding-pruning-refining methods

Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this paper, we tackle the problem of skyline analysis on uncertain data. We propose a novel probabilistic skyline model where an uncertain object may take a probability to be in the skyline, and a p-skyline contains all objects whose skyline probabilities are at least p (0 < p ≤ 1). Computing probabilistic skylines on large uncertain data sets is challenging. We develop a bounding-pruning-refining framework and three algorithms systematically. The bottom-up algorithm computes the skyline probabilities of some selected instances of uncertain objects, and uses those instances to prune other instances and uncertain objects effectively. The top-down algorithm recursively partitions the instances of uncertain objects into subsets, and prunes subsets and objects aggressively. Combining the advantages of the bottom-up algorithm and the top-down algorithm, we develop a hybrid algorithm to further improve the performance. Our experimental results on both the real NBA player data set and the benchmark synthetic data sets show that probabilistic skylines are interesting and useful, and our algorithms are efficient on large data sets.

[1]  Wolf-Tilo Balke,et al.  Efficient Distributed Skylining for Web Information Systems , 2004, EDBT.

[2]  Mikhail J. Atallah,et al.  Computing all skyline probabilities for uncertain data , 2009, PODS.

[3]  Xiang Lian,et al.  Dynamic skyline queries in metric spaces , 2008, EDBT '08.

[4]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[5]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[6]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[7]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[8]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[9]  Ben Y. Zhao,et al.  Parallelizing Skyline Queries for Scalable Distribution , 2006, EDBT.

[10]  Bernhard Seeger,et al.  An optimal and progressive algorithm for skyline queries , 2003, SIGMOD '03.

[11]  Raymond Chi-Wing Wong,et al.  Mining favorable facets , 2007, KDD '07.

[12]  Beng Chin Ooi,et al.  Efficient Progressive Skyline Computation , 2001, VLDB.

[13]  H. T. Kung,et al.  On the Average Number of Maxima in a Set of Vectors and Applications , 1978, JACM.

[14]  Tian Xia,et al.  Refreshing the sky: the compressed skycube with efficient support for frequent updates , 2006, SIGMOD Conference.

[15]  Rong Yan,et al.  Adapting SVM Classifiers to Data with Shifted Distributions , 2007 .

[16]  Anthony K. H. Tung,et al.  Minimizing the communication cost for continuous skyline maintenance , 2009, SIGMOD Conference.

[17]  Hongjun Lu,et al.  Stabbing the sky: efficient skyline computation over sliding windows , 2005, 21st International Conference on Data Engineering (ICDE'05).

[18]  Jian Pei,et al.  Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces , 2005, VLDB.

[19]  Xiang Lian,et al.  Monochromatic and bichromatic reverse skyline search over uncertain databases , 2008, SIGMOD Conference.

[20]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[21]  Jon Kleinberg,et al.  KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining , 2007, KDD 2007.

[22]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[23]  Dan Suciu,et al.  Management of probabilistic data: foundations and challenges , 2007, PODS '07.

[24]  A. Guttman,et al.  A Dynamic Index Structure for Spatial Searching , 1984, SIGMOD 1984.

[25]  Beng Chin Ooi,et al.  Skyline Queries Against Mobile Lightweight Devices in MANETs , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[26]  Qing Liu,et al.  Efficient Computation of the Skyline Cube , 2005, VLDB.

[27]  S YuPhilip,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009 .

[28]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[29]  Jan Chomicki,et al.  Skyline with presorting , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[30]  Ashwin Lall,et al.  Randomized Multi-pass Streaming Skyline Algorithms , 2009, Proc. VLDB Endow..

[31]  Sunita Sarawagi,et al.  Probabilistic Graphical Models and their Role in Databases , 2007, VLDB.

[32]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[33]  Yin Yang,et al.  Kernel-based skyline cardinality estimation , 2009, SIGMOD Conference.

[34]  Jennifer Widom,et al.  Working Models for Uncertain Data , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[35]  T. S. Jayram,et al.  OLAP over uncertain and imprecise data , 2007, The VLDB Journal.

[36]  Jonghyun Park,et al.  Parallel Skyline Computation on Multicore Architectures , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[37]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[38]  Anthony K. H. Tung,et al.  Finding k-dominant skylines in high dimensional space , 2006, SIGMOD Conference.

[39]  Kian-Lee Tan,et al.  Stratified computation of skylines with partially-ordered domains , 2005, SIGMOD '05.

[40]  Jian Pei,et al.  SUBSKY: Efficient Computation of Skylines in Subspaces , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[41]  Stavros Papadopoulos,et al.  Topologically Sorted Skylines for Partially Ordered Domains , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[42]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[43]  A. Guttmma,et al.  R-trees: a dynamic index structure for spatial searching , 1984 .

[44]  Jarek Gryz,et al.  Maximal Vector Computation in Large Data Sets , 2005, VLDB.

[45]  Bin Jiang,et al.  Mining preferences from superior and inferior examples , 2008, KDD.

[46]  Yufei Tao,et al.  Maintaining sliding window skylines on data streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[47]  Cyrus Shahabi,et al.  The spatial skyline queries , 2006, VLDB.

[48]  Yufei Tao,et al.  Probabilistic Spatial Queries on Existentially Uncertain Data , 2005, SSTD.

[49]  Jan Chomicki,et al.  Discovering Relative Importance of Skyline Attributes , 2009, Proc. VLDB Endow..

[50]  Anthony K. H. Tung,et al.  On High Dimensional Skylines , 2006, EDBT.

[51]  Ralph Arnote,et al.  Hong Kong (China) , 1996, OECD/G20 Base Erosion and Profit Shifting Project.

[52]  Serge Abiteboul,et al.  On the representation and querying of sets of possible worlds , 1987, SIGMOD '87.

[53]  Jeffrey Xu Yu,et al.  Probabilistic Skyline Operator over Sliding Windows , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[54]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[55]  Bin Jiang,et al.  Online Interval Skyline Queries on Time Series , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[56]  Lise Getoor,et al.  Representing Tuple and Attribute Uncertainty in Probabilistic Databases , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[57]  Hans-Peter Kriegel,et al.  Probabilistic Similarity Join on Uncertain Data , 2006, DASFAA.

[58]  Jignesh M. Patel,et al.  Efficient Continuous Skyline Computation , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[59]  Donald Kossmann,et al.  Shooting Stars in the Sky: An Online Algorithm for Skyline Queries , 2002, VLDB.

[60]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[61]  Christian Böhm,et al.  Probabilistic skyline queries , 2009, CIKM.

[62]  Bernhard Seeger,et al.  Efficient Computation of Reverse Skyline Queries , 2007, VLDB.