Selecting Representative Objects Considering Coverage and Diversity

We say that an object o attracts a user u if o is one of the top-k objects according to the preference function defined by u. Given a set of objects (e.g., restaurants) and a set of users, in this paper, we study the problem of computing a set of representative objects considering two criteria: coverage and diversity. Coverage of a set S of objects is the distinct number of users that are attracted by the objects in S. Although a set of objects with high coverage attracts a large number of users, it is possible that all of these users have quite similar preferences. Consequently, the set of objects may be attractive only for a specific class of users with similar preference functions which may disappoint other users having widely different preferences. The diversity criterion addresses this issue by selecting a set S of objects such that the set of attracted users for each object in S is as different as possible from the sets of users attracted by the other objects in S. The existing work on representative objects considers only one of the coverage and diversity criteria. We are the first to consider both of the criteria where the importance of each criterion can be controlled using a parameter. Our algorithm has two phases. In the first phase, we prune the objects that cannot be among the representative objects and compute the set of attracted users (also called reverse top-k) for each of the remaining objects. In the second phase, the reverse top-k of these objects are used to compute the representative objects maximizing coverage and diversity. Since this problem is NP-hard, the second phase employs a greedy algorithm. For the sake of time and space efficiency, we adopt MinHash and KMV Synopses to assist the set operations. We prove that the proposed greedy algorithm is ϵ-approximate. Our extensive experimental study on real and synthetic data sets demonstrates the effectiveness of our proposed techniques.

[1]  Haixun Wang,et al.  A Generic Framework for Top-${\schmi k}$ Pairs and Top- ${\schmi k}$ Objects Queries over Sliding Windows , 2014, IEEE Transactions on Knowledge and Data Engineering.

[2]  Muhammad Aamir Cheema,et al.  Reverse k Nearest Neighbors Query Processing: Experiments and Analysis , 2015, Proc. VLDB Endow..

[3]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[4]  Ira Assent,et al.  Taking the Big Picture: representative skylines based on significance and diversity , 2014, The VLDB Journal.

[5]  Arbee L. P. Chen,et al.  Determining k-most demanding products with maximum expected number of total customers , 2013, IEEE Transactions on Knowledge and Data Engineering.

[6]  Evaggelia Pitoura,et al.  DisC diversity: result diversification based on dissimilarity and coverage , 2012, Proc. VLDB Endow..

[7]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[8]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[9]  Zhitao Shen,et al.  A Unified Framework for Efficiently Processing Ranking Related Queries , 2014, EDBT.

[10]  E. Erkut The discrete p-dispersion problem , 1990 .

[11]  Pankaj K. Agarwal,et al.  Processing a large number of continuous preference top-k queries , 2012, SIGMOD Conference.

[12]  Jian Pei,et al.  Distance-Based Representative Skyline , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[13]  Christos Doulkeridis,et al.  Identifying the most influential data objects with reverse top-k queries , 2010, Proc. VLDB Endow..

[14]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.

[15]  Jiawei Han,et al.  Towards robust indexing for ranked queries , 2006, VLDB.

[16]  Muhammad Aamir Cheema,et al.  Diversified Spatial Keyword Search On Road Networks , 2014, EDBT.

[17]  Christos Doulkeridis,et al.  Finding the Most Diverse Products using Preference Queries , 2015, EDBT.

[18]  Arbee L. P. Chen,et al.  Finding k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document} most favorite products based on reverse top , 2013, The VLDB Journal.

[19]  Haixun Wang,et al.  A Generic Framework for Top-k Pairs and Top-k Objects Queries over Sliding Windows , 2014, IEEE Trans. Knowl. Data Eng..

[20]  Nikos Mamoulis,et al.  Efficient All Top-k Computation - A Unified Solution for All Top-k, Reverse Top-k and Top-m Influential Queries , 2013, IEEE Transactions on Knowledge and Data Engineering.

[21]  Christos Doulkeridis,et al.  Branch-and-bound algorithm for reverse top-k queries , 2013, SIGMOD '13.

[22]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[23]  Yufei Tao,et al.  Branch-and-bound processing of ranked queries , 2007, Inf. Syst..

[24]  Xuemin Lin,et al.  Selecting Stars: The k Most Representative Skyline Operator , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[25]  Christos Doulkeridis,et al.  Reverse top-k queries , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[26]  Vahab S. Mirrokni,et al.  Composable core-sets for diversity and coverage maximization , 2014, PODS.