Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially

Center-based clustering is a fundamental primitive for data analysis and becomes very challenging for large datasets. In this paper, we focus on the popular $k$-center variant which, given a set $S$ of points from some metric space and a parameter $k 0$, the algorithms yield solutions whose approximation ratios are a mere additive term $\epsilon$ away from those achievable by the best known polynomial-time sequential algorithms, a result that substantially improves upon the state of the art. Our algorithms are rather simple and adapt to the intrinsic complexity of the dataset, captured by the doubling dimension $D$ of the metric space. Specifically, our analysis shows that the algorithms become very space-efficient for the important case of small (constant) $D$. These theoretical results are complemented with a set of experiments on real-world and synthetic datasets of up to over a billion points, which show that our algorithms yield better quality solutions over the state of the art while featuring excellent scalability, and that they also lend themselves to sequential implementations much faster than existing ones.

[1]  Michael Mitzenmacher,et al.  Probability And Computing , 2005 .

[2]  Eli Upfal,et al.  A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[3]  Geppino Pucci,et al.  Clustering Uncertain Graphs , 2016, Proc. VLDB Endow..

[4]  Jure Leskovec,et al.  Mining of Massive Datasets, 2nd Ed , 2014 .

[5]  Benjamin Moseley,et al.  Fast and Better Distributed MapReduce Algorithms for k-Center Clustering , 2015, SPAA.

[6]  Samir Khuller,et al.  Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity , 2008, APPROX-RANDOM.

[7]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[8]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[11]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[12]  Anthony Wirth,et al.  Efficient Parallel Algorithms for k-Center Clustering , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[13]  Sepehr Assadi,et al.  Randomized Composable Coresets for Matching and Vertex Cover , 2017, SPAA.

[14]  Aravind Srinivasan,et al.  Probability and Computing , 2018, SIGA.

[15]  Eli Upfal,et al.  MapReduce and Streaming Algorithms for Diversity Maximization in Metric Spaces of Bounded Doubling Dimension , 2016, Proc. VLDB Endow..

[16]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[17]  Gustavo Malkomes,et al.  Fast Distributed k-Center Clustering with Outliers on Massive Data , 2015, NIPS.

[18]  Fionn Murtagh,et al.  Handbook of Cluster Analysis , 2015 .

[19]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[20]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[21]  Samir Khuller,et al.  Algorithms for facility location problems with outliers , 2001, SODA '01.

[22]  Morteza Zadimoghaddam,et al.  Randomized Composable Core-sets for Distributed Submodular Maximization , 2015, STOC.

[23]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[24]  Eli Upfal,et al.  Space-round tradeoffs for MapReduce computations , 2011, ICS '12.

[25]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[26]  Geppino Pucci,et al.  Fast Coreset-based Diversity Maximization under Matroid Constraints , 2018, WSDM.

[27]  Eli Upfal,et al.  Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation , 2014, SPAA.

[28]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[29]  Hamid Zarrabi-Zadeh,et al.  Diversity Maximization via Composable Coresets , 2015, CCCG.

[30]  Vahab S. Mirrokni,et al.  Composable core-sets for diversity and coverage maximization , 2014, PODS.

[31]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[32]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[33]  Maria-Florina Balcan,et al.  Center Based Clustering: A Foundational Perspective , 2014 .

[34]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.