Efficient Computation of Subspace Skyline over Categorical Domains

Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed the way we search for accommodation, restaurants, etc. The underlying datasets in such applications have numerous attributes that are mostly Boolean or Categorical. Discovering the skyline of such datasets over a subset of attributes would identify entries that stand out while enabling numerous applications. There are only a few algorithms designed to compute the skyline over categorical attributes, yet are applicable only when the number of attributes is small. In this paper, we place the problem of skyline discovery over categorical attributes into perspective and design efficient algorithms for two cases. (i) In the absence of indices, we propose two algorithms, ST-S and ST-P, that exploit the categorical characteristics of the datasets, organizing tuples in a tree data structure, supporting efficient dominance tests over the candidate set. (ii) We then consider the existence of widely used precomputed sorted lists. After discussing several approaches, and studying their limitations, we propose TA-SKY, a novel threshold style algorithm that utilizes sorted lists. Moreover, we further optimize TA-SKY and explore its progressive nature, making it suitable for applications with strict interactive requirements. In addition to the extensive theoretical analysis of the proposed algorithms, we conduct a comprehensive experimental evaluation of the combination of real (including the entire AirBnB data collection) and synthetic datasets to study the practicality of the proposed algorithms. The results showcase the superior performance of our techniques, outperforming applicable approaches by orders of magnitude.

[1]  Ken C. K. Lee,et al.  Approaching the Skyline in Z Order , 2007, VLDB.

[2]  Alon Y. Halevy,et al.  Answering queries using views: A survey , 2001, The VLDB Journal.

[3]  Ilaria Bartolini,et al.  Efficient sort-based skyline evaluation , 2008, TODS.

[4]  Qing Liu,et al.  Efficient Computation of the Skyline Cube , 2005, VLDB.

[5]  References , 1971 .

[6]  Jian Pei,et al.  Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces , 2005, VLDB.

[7]  Jignesh M. Patel,et al.  Efficient Skyline Computation over Low-Cardinality Domains , 2007, VLDB.

[8]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[9]  Nikos Mamoulis,et al.  Scalable skyline computation using object-based space partitioning , 2009, SIGMOD Conference.

[10]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[11]  Seung-won Hwang,et al.  Scalable skyline computation using a balanced pivot selection technique , 2014, Inf. Syst..

[12]  Ching-Lai Hwang,et al.  Multiple Attribute Decision Making: Methods and Applications - A State-of-the-Art Survey , 1981, Lecture Notes in Economics and Mathematical Systems.

[13]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[14]  C. Hwang Multiple Objective Decision Making - Methods and Applications: A State-of-the-Art Survey , 1979 .

[15]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[16]  Jie Wang,et al.  Online subspace skyline query processing using the compressed skycube , 2012, TODS.

[17]  Jarek Gryz,et al.  Maximal Vector Computation in Large Data Sets , 2005, VLDB.

[18]  Timotheus Preisinger The Hexagon Algorithm for Pareto Preference Queries , 2007 .

[19]  Seung-won Hwang,et al.  Toward efficient multidimensional subspace skyline computation , 2013, The VLDB Journal.

[20]  Dimitrios Gunopulos,et al.  Answering top-k queries using views , 2006, VLDB.

[21]  Tian Xia,et al.  Refreshing the sky: the compressed skycube with efficient support for frequent updates , 2006, SIGMOD Conference.

[22]  Ashish Gupta,et al.  Aggregate-Query Processing in Data Warehousing Environments , 1995, VLDB.

[23]  Jian Pei,et al.  SUBSKY: Efficient Computation of Skylines in Subspaces , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[24]  Jan Chomicki,et al.  Skyline with Presorting: Theory and Optimizations , 2005, Intelligent Information Systems.

[25]  Ronald Fagin,et al.  Combining fuzzy information from multiple systems (extended abstract) , 1996, PODS.

[26]  Chedy Raïssi,et al.  Computing closed skycubes , 2010, Proc. VLDB Endow..

[27]  Bernhard Seeger,et al.  An optimal and progressive algorithm for skyline queries , 2003, SIGMOD '03.

[28]  Beng Chin Ooi,et al.  Efficient Progressive Skyline Computation , 2001, VLDB.

[29]  Jignesh M. Patel,et al.  Efficient Continuous Skyline Computation , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[30]  Carlos Ordonez,et al.  Skycube Materialization Using the Topmost Skyline or Functional Dependencies , 2016, ACM Trans. Database Syst..

[31]  Abolfazl Asudeh,et al.  Crowdsourcing Pareto-Optimal Object Finding By Pairwise Comparisons , 2014, CIKM.

[32]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[33]  Donald Kossmann,et al.  Shooting Stars in the Sky: An Online Algorithm for Skyline Queries , 2002, VLDB.