Composable core-sets for diversity and coverage maximization

In this paper we consider efficient construction of "composable core-sets" for basic diversity and coverage maximization problems. A core-set for a point-set in a metric space is a subset of the point-set with the property that an approximate solution to the whole point-set can be obtained given the core-set alone. A composable core-set has the property that for a collection of sets, the approximate solution to the union of the sets in the collection can be obtained given the union of the composable core-sets for the point sets in the collection. Using composable core-sets one can obtain efficient solutions to a wide variety of massive data processing applications, including nearest neighbor search, streaming algorithms and map-reduce computation. Our main results are algorithms for constructing composable core-sets for several notions of "diversity objective functions", a topic that attracted a significant amount of research over the last few years. The composable core-sets we construct are small and accurate: their approximation factor almost matches that of the best "off-line" algorithms for the relevant optimization problems (up to a constant factor). Moreover, we also show applications of our results to diverse nearest neighbor search, streaming algorithms and map-reduce computation. Finally, we show that for an alternative notion of diversity maximization based on the maximum coverage problem small composable core-sets do not exist.

[1]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[2]  Vahab S. Mirrokni,et al.  Diversity maximization under matroid constraints , 2013, KDD.

[3]  Sudipto Guha Tight results for clustering and summarizing data streams , 2009, ICDT '09.

[4]  Barun Chandra,et al.  Approximation Algorithms for Dispersion Problems , 2001, J. Algorithms.

[5]  Shourya Roy,et al.  A hierarchical monothetic document clustering algorithm for summarization and browsing search results , 2004, WWW '04.

[6]  Christopher Olston,et al.  Search result diversity for informational queries , 2011, WWW.

[7]  Graham Cormode,et al.  Mergeable summaries , 2012, PODS '12.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[10]  Silvio Lattanzi,et al.  Filtering: a method for solving graph problems in MapReduce , 2011, SPAA '11.

[11]  Andrei Z. Broder,et al.  Sampling Search-Engine Results , 2005, WWW '05.

[12]  Sean M. McNee,et al.  Improving recommendation lists through topic diversification , 2005, WWW '05.

[13]  Filip Radlinski,et al.  Improving personalized web search using result diversification , 2006, SIGIR.

[14]  Ravi Kumar,et al.  Max-cover in map-reduce , 2010, WWW '10.

[15]  Evaggelia Pitoura,et al.  Search result diversification , 2010, SGMD.

[16]  Nick Koudas,et al.  Efficient diversity-aware search , 2011, SIGMOD '11.

[17]  Jade Goldstein-Stewart,et al.  The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[18]  Recommendation Diversification Using Explanations , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[19]  S. S. Ravi,et al.  Facility Dispersion Problems: Heuristics and Special Cases (Extended Abstract) , 1991, WADS.

[20]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[21]  Sreenivas Gollapudi,et al.  An Axiomatic Framework for Result Diversification , 2009, IEEE Data Eng. Bull..

[22]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[23]  Hui Lin,et al.  Graph-based submodular selection for extractive summarization , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[24]  Jakob Andreas Bærentzen,et al.  3D distance fields: a survey of techniques and applications , 2006, IEEE Transactions on Visualization and Computer Graphics.

[25]  Krishna Bharat,et al.  Diversifying web search results , 2010, WWW '10.

[26]  Sergei Vassilvitskii,et al.  Fast greedy algorithms in mapreduce and streaming , 2013, SPAA.

[27]  Maria-Florina Balcan,et al.  Distributed Clustering on Graphs , 2013, ArXiv.

[28]  Sihem Amer-Yahia,et al.  Diverse near neighbor problem , 2013, SoCG '13.

[29]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[30]  Rishabh K. Iyer,et al.  Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints , 2013, NIPS.

[31]  Sihem Amer-Yahia,et al.  Real-time recommendation of diverse related articles , 2013, WWW.

[32]  Tao Li,et al.  Addressing diverse user preferences in SQL-query-result navigation , 2007, SIGMOD '07.

[33]  Jiawei Han,et al.  Extracting redundancy-aware top-k patterns , 2006, KDD '06.

[34]  Graham Cormode,et al.  Set cover algorithms for very large datasets , 2010, CIKM.

[35]  Jayant R. Haritsa,et al.  Providing Diversity in K-Nearest Neighbor Query Results , 2003, PAKDD.

[36]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[37]  Kamesh Munagala,et al.  Consideration set generation in commerce search , 2011, WWW.

[38]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[39]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[40]  Refael Hassin,et al.  Approximation algorithms for maximum dispersion , 1997, Oper. Res. Lett..

[41]  Yuli Ye,et al.  Max-Sum diversification, monotone submodular functions and dynamic updates , 2012, PODS '12.