Probabilistic aggregate skyline join queries: skylines with aggregate operations over existentially uncertain relations

The multi-criteria decision making, made possible by the advent of skyline queries, has been successfully applied in many areas. Though most of the earlier work is concerned with only a single relation, several real world applications require finding the skyline set over multiple relations. Consequently, the join operation over skylines where the preferences are local to each relation and/or on aggregated values of attributes from different relations, has been proposed. In the meanwhile, uncertain datasets are witnessing increasing applications in many scientific and real-life situations. The problem of skyline computation for such datasets becomes even more challenging as every object can be classified as a skyline with some probability. In this paper, we introduce probabilistic aggregate skyline join queries (PASJQ) that ask for objects whose probability of being a skyline from a join of two uncertain relations is over a query probability threshold. The skyline preferences are on both local and aggregate attributes. Since the naïve algorithm can be impractical, we propose three algorithms to efficiently process such queries. The algorithms process the skylines as much as possible locally before computing the join to reduce the computation burden of finding skylines from the larger joined relation. Experiments with real and synthetic data exhibit the practicality and scalability of these algorithms with respect to query probability threshold, cardinality, dimensionality and other parameters of the uncertain relations.

[1]  Bin Jiang,et al.  Probabilistic Skylines on Uncertain Data , 2007, VLDB.

[2]  Kevin Chen-Chuan Chang,et al.  URank: formulation and efficient evaluation of top-k queries in uncertain databases , 2007, SIGMOD '07.

[3]  Dan Olteanu,et al.  MayBMS: a probabilistic database management system , 2009, SIGMOD Conference.

[4]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[5]  Xiang Lian,et al.  Probabilistic ranked queries in uncertain databases , 2008, EDBT '08.

[6]  Bin Jiang,et al.  Probabilistic skylines on uncertain data: model and bounding-pruning-refining methods , 2010, Journal of Intelligent Information Systems.

[7]  Michael Stonebraker,et al.  Content integration for e-business , 2001, SIGMOD '01.

[8]  Jonathan Goldstein,et al.  Processing queries by linear constraints , 1997, PODS '97.

[9]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[10]  Jeffrey Scott Vitter,et al.  Efficient join processing over uncertain data , 2006, CIKM '06.

[11]  Seung-won Hwang,et al.  Skyline ranking for uncertain data with maybe confidence , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[12]  Jarek Gryz,et al.  Maximal Vector Computation in Large Data Sets , 2005, VLDB.

[13]  Chi-Yin Chow,et al.  Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[14]  Jian Li,et al.  A unified approach to ranking in probabilistic databases , 2009, The VLDB Journal.

[15]  Jian Pei,et al.  Ranking queries on uncertain data: a probabilistic threshold approach , 2008, SIGMOD Conference.

[16]  Ihab F. Ilyas,et al.  Efficient search for the top-k probable nearest neighbors in uncertain databases , 2008, Proc. VLDB Endow..

[17]  Yufei Tao,et al.  Indexing Multi-Dimensional Uncertain Data with Arbitrary Probability Density Functions , 2005, VLDB.

[18]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[19]  Anthony K. H. Tung,et al.  Skyline-join in distributed databases , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[20]  Jiawei Han,et al.  The Multi-Relational Skyline Operator , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[21]  Raymond Chi-Wing Wong,et al.  Creating Competitive Products , 2009, Proc. VLDB Endow..

[22]  Jeffrey Scott Vitter,et al.  Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data , 2004, VLDB.

[23]  Jennifer Widom,et al.  ULDBs: databases with uncertainty and lineage , 2006, VLDB.

[24]  Chuang Liu,et al.  Design and evaluation of a resource selection framework for Grid applications , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[25]  Tingjian Ge,et al.  Join queries on uncertain data: Semantics and efficient processing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[26]  Jan Chomicki,et al.  Skyline with presorting , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[27]  I. G. BONNER CLAPPISON Editor , 1960, The Electric Power Engineering Handbook - Five Volume Set.

[28]  Bernhard Seeger,et al.  An optimal and progressive algorithm for skyline queries , 2003, SIGMOD '03.

[29]  Mikhail J. Atallah,et al.  Asymptotically efficient algorithms for skyline probabilities of uncertain data , 2011, TODS.

[30]  Abraham Silberschatz,et al.  Database System Concepts , 1980 .

[31]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[32]  Christopher Ré,et al.  MYSTIQ: a system for finding more answers by using probabilities , 2005, SIGMOD '05.

[33]  Donald Kossmann,et al.  Shooting Stars in the Sky: An Online Algorithm for Skyline Queries , 2002, VLDB.

[34]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[35]  Arnab Bhattacharya,et al.  Aggregate Skyline Join Queries: Skylines with Aggregate Operations over Multiple Relations , 2010, COMAD.

[36]  Sunil Prabhakar,et al.  U-DBMS: A Database System for Managing Constantly-Evolving Data , 2005, VLDB.

[37]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.