MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension

Given a dataset of points in a metric space and an integer $k$, a diversity maximization problem requires determining a subset of $k$ points maximizing some diversity objective measure, e.g., the minimum or the average distance between two points in the subset. Diversity maximization is computationally hard, hence only approximate solutions can be hoped for. Although its applications are mainly in massive data analysis, most of the past research on diversity maximization focused on the sequential setting. In this work we present space and pass/round-efficient diversity maximization algorithms for the Streaming and MapReduce models and analyze their approximation guarantees for the relevant class of metric spaces of bounded doubling dimension. Like other approaches in the literature, our algorithms rely on the determination of high-quality core-sets, i.e., (much) smaller subsets of the input which contain good approximations to the optimal solution for the whole input. For a variety of diversity objective functions, our algorithms attain an $(\alpha+\epsilon)$-approximation ratio, for any constant $\epsilon>0$, where $\alpha$ is the best approximation ratio achieved by a polynomial-time, linear-space sequential algorithm for the same diversity objective. This improves substantially over the approximation ratios attainable in Streaming and MapReduce by state-of-the-art algorithms for general metric spaces. We provide extensive experimental evidence of the effectiveness of our algorithms on both real world and synthetic datasets, scaling up to over a billion points.

[1]  Aleksandrs Slivkins Distance estimation and object location via rings of neighbors , 2006, Distributed Computing.

[2]  Hamid Zarrabi-Zadeh,et al.  Diversity Maximization via Composable Coresets , 2015, CCCG.

[3]  S. S. Ravi,et al.  Approximation Algorithms for Facility Dispersion , 2018, Handbook of Approximation Algorithms and Metaheuristics.

[4]  M. Kuby Programming Models for Facility Dispersion: The p‐Dispersion and Maxisum Dispersion Problems , 2010 .

[5]  Yi Li,et al.  Using the doubling dimension to analyze the generalization of learning algorithms , 2009, J. Comput. Syst. Sci..

[6]  S. S. Ravi,et al.  Approximation Algorithms for Facility Dispersion , 2018, Handbook of Approximation Algorithms and Metaheuristics.

[7]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[8]  Yi Yang,et al.  Multi-Class Active Learning by Uncertainty Sampling with Diversity Maximization , 2015, International Journal of Computer Vision.

[9]  Robert Krauthgamer,et al.  Navigating nets: simple algorithms for proximity search , 2004, SODA '04.

[10]  Yong Cheng Wu,et al.  Active Learning Based on Diversity Maximization , 2013 .

[11]  Takeshi Tokuyama,et al.  Finding subsets maximizing minimum structures , 1995, SODA '95.

[12]  Eli Upfal,et al.  Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation , 2014, SPAA.

[13]  Barun Chandra,et al.  Approximation Algorithms for Dispersion Problems , 2001, J. Algorithms.

[14]  Marcel R. Ackermann,et al.  Clustering for metric and non-metric distance measures , 2008, SODA '08.

[15]  Friedrich Eisenbrand,et al.  Max-Sum Diversity Via Convex Programming , 2015, SoCG.

[16]  Lee-Ad Gottlieb,et al.  The traveling salesman problem: low-dimensionality implies a polynomial time approximation scheme , 2011, STOC '12.

[17]  Eli Upfal,et al.  Space-round tradeoffs for MapReduce computations , 2011, ICS '12.

[18]  Andréa W. Richa,et al.  Dynamic routing and location services in metrics of low doubling dimension , 2008, PODC '08.

[19]  Benjamin E. Birnbaum,et al.  An Improved Analysis for a Greedy Remote-Clique Algorithm Using Factor-Revealing LPs , 2007, Algorithmica.

[20]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[21]  Arie Tamir,et al.  Obnoxious Facility Location on Graphs , 1991, SIAM J. Discret. Math..

[22]  S. S. Ravi,et al.  Heuristic and Special Case Algorithms for Dispersion Problems , 1994, Oper. Res..

[23]  Lee-Ad Gottlieb,et al.  Proximity Algorithms for Nearly Doubling Spaces , 2013, SIAM J. Discret. Math..

[24]  Andrew V. Goldberg,et al.  Routing in Networks with Low Doubling Dimension , 2006, 26th IEEE International Conference on Distributed Computing Systems (ICDCS'06).

[25]  Sándor P. Fekete,et al.  Maximum Dispersion and Geometric Maximum Weight Cliques , 2003, Algorithmica.

[26]  Jon M. Kleinberg,et al.  Triangulation and Embedding Using Small Sets of Beacons , 2004, FOCS.

[27]  Michael Masin,et al.  Diversity Maximization Approach for Multiobjective Optimization , 2008, Oper. Res..

[28]  Vahab S. Mirrokni,et al.  Composable core-sets for diversity and coverage maximization , 2014, PODS.

[29]  Din J. Wasem Mining of Massive Datasets , 2014 .

[30]  Robert Krauthgamer,et al.  Bounded geometries, fractals, and low-distortion embeddings , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[31]  Kunal Talwar,et al.  Bypassing the embedding: algorithms for low dimensional metrics , 2004, STOC '04.

[32]  Recommendation Diversification Using Explanations , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[33]  E. Erkut,et al.  Comparison Of Four Models For dispersing Facilities , 1991 .

[34]  Jure Leskovec,et al.  Mining of Massive Datasets, 2nd Ed , 2014 .

[35]  Vahab S. Mirrokni,et al.  Diversity maximization under matroid constraints , 2013, KDD.

[36]  Kamesh Munagala,et al.  Consideration set generation in commerce search , 2011, WWW.

[37]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[38]  David M. Mount,et al.  Approximation algorithm for the kinetic robust K-center problem , 2010, Comput. Geom..

[39]  Sreenivas Gollapudi,et al.  An axiomatic approach for result diversification , 2009, WWW '09.

[40]  Refael Hassin,et al.  Approximation algorithms for maximum dispersion , 1997, Oper. Res. Lett..

[41]  UpfalEli,et al.  MapReduce and streaming algorithms for diversity maximization in metric spaces of bounded doubling dimension , 2017, VLDB 2017.

[42]  Tao Li,et al.  Addressing diverse user preferences in SQL-query-result navigation , 2007, SIGMOD '07.

[43]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[44]  Benjamin E. Birnbaum,et al.  An Improved Analysis for a Greedy Remote-Clique Algorithm Using Factor-Revealing LPs , 2006, Algorithmica.

[45]  E. Erkut The discrete p-dispersion problem , 1990 .

[46]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[47]  Nick Koudas,et al.  Efficient diversity-aware search , 2011, SIGMOD '11.

[48]  Sean A. Munson,et al.  Sidelines: An Algorithm for Increasing Diversity in News and Opinion Aggregators , 2009, ICWSM.

[49]  Eli Upfal,et al.  A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[50]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[51]  Richard Cole,et al.  Searching dynamic point sets in spaces with bounded doubling dimension , 2006, STOC '06.

[52]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[53]  John Langford,et al.  Cover trees for nearest neighbor , 2006, ICML.

[54]  Lee-Ad Gottlieb,et al.  Efficient Classification for Metric Data , 2014, IEEE Trans. Inf. Theory.