SkyDiver: a framework for skyline diversification

Skyline queries have attracted considerable attention by the database community during the last decade, due to their applicability in a series of domains. However, most existing works tackle the problem from an efficiency standpoint, i.e., returning the skyline as quickly as possible. The user is then presented with the entire skyline set, which may be in several cases overwhelming, therefore requiring manual inspection to come up with the most informative data points. To overcome this shortcoming, we propose a novel approach in selecting the k most diverse skyline points, i.e., the ones that best capture the different aspects of both the skyline and the dataset they belong to. We present a novel formulation of diversification which, in contrast to previous proposals, is intuitive, because it is based solely on the domination relationships among points. Consequently, additional artificial distance measures (e.g., Lp norms) among skyline points are not required. We present efficient approaches in solving this problem and demonstrate the efficiency and effectiveness of our approach through an extensive experimental evaluation with both real-life and synthetic data sets.

[1]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[2]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[3]  David Pisinger,et al.  Upper bounds and exact algorithms for p-dispersion problems , 2006, Comput. Oper. Res..

[4]  Ashwin Lall,et al.  Randomized Multi-pass Streaming Skyline Algorithms , 2009, Proc. VLDB Endow..

[5]  Evaggelia Pitoura,et al.  Dynamic diversification of continuous data , 2012, EDBT '12.

[6]  George Valkanas,et al.  Efficient and domain-invariant competitor mining , 2012, KDD.

[7]  Yang Xiang,et al.  l-SkyDiv query: Effectively improve the usefulness of skylines , 2010, Science China Information Sciences.

[8]  Nick Koudas,et al.  Efficient diversity-aware search , 2011, SIGMOD '11.

[9]  F. Glover,et al.  Analyzing and Modeling the Maximum Diversity Problem by Zero‐One Programming* , 1993 .

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.

[12]  Man Lung Yiu,et al.  Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data , 2007, VLDB.

[13]  Xuemin Lin,et al.  Selecting Stars: The k Most Representative Skyline Operator , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[14]  Bert R. Boyce,et al.  Beyond topicality : A two stage view of relevance and the retrieval process , 1982, Inf. Process. Manag..

[15]  Richard J. Lipton,et al.  Representative skylines using threshold-based preference distributions , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[16]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[17]  Stephen E. Robertson,et al.  Ambiguous requests: implications for retrieval tests, systems and theories , 2007, SIGF.

[18]  Nikos Mamoulis,et al.  Efficient skyline evaluation over partially ordered domains , 2010, Proc. VLDB Endow..

[19]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[20]  Wolfgang Maass,et al.  Efficient agnostic PAC-learning with simple hypothesis , 1994, COLT '94.

[21]  Andrew Chi-Chih Yao,et al.  Probabilistic computations: Toward a unified measure of complexity , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[22]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[23]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[24]  Yunjun Gao,et al.  Finding the Most Desirable Skyline Objects , 2010, DASFAA.

[25]  Jayant R. Haritsa The KNDN Problem: A Quest for Unity in Diversity , 2009, IEEE Data Eng. Bull..

[26]  Kenneth A. Ross,et al.  Semantic ranking and result visualization for life sciences publications , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[27]  H. T. Kung,et al.  On the Average Number of Maxima in a Set of Vectors and Applications , 1978, JACM.

[28]  Erhan Erkut,et al.  A comparison of p-dispersion heuristics , 1994, Comput. Oper. Res..

[29]  S. Muthukrishnan,et al.  Estimating Rarity and Similarity over Data Stream Windows , 2002, ESA.

[30]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[31]  Yannis Manolopoulos,et al.  SkyGraph: an algorithm for important subgraph discovery in relational graphs , 2008, Data Mining and Knowledge Discovery.

[32]  Yufei Tao,et al.  On finding skylines in external memory , 2011, PODS.

[33]  Sreenivas Gollapudi,et al.  Diversifying search results , 2009, WSDM '09.

[34]  Jian Pei,et al.  Distance-Based Representative Skyline , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[35]  S. S. Ravi,et al.  Heuristic and Special Case Algorithms for Dispersion Problems , 1994, Oper. Res..

[36]  Jay B. Ghosh,et al.  Computational aspects of the maximum diversity problem , 1996, Oper. Res. Lett..

[37]  Michael T. Goodrich,et al.  Almost optimal set covers in finite VC-dimension: (preliminary version) , 1994, SCG '94.

[38]  Divesh Srivastava,et al.  DivDB: A System for Diversifying Query Results , 2011, Proc. VLDB Endow..

[39]  Yi Chen,et al.  Structured Search Result Differentiation , 2009, Proc. VLDB Endow..