Massively Parallel Algorithms and Hardness for Single-Linkage Clustering Under $\ell_p$-Distances

We present massively parallel (MPC) algorithms and hardness of approximation results for computing Single-Linkage Clustering of $n$ input $d$-dimensional vectors under Hamming, $\ell_1, \ell_2$ and $\ell_\infty$ distances. All our algorithms run in $O(\log n)$ rounds of MPC for any fixed $d$ and achieve $(1+\epsilon)$-approximation for all distances (except Hamming for which we show an exact algorithm). We also show constant-factor inapproximability results for $o(\log n)$-round algorithms under standard MPC hardness assumptions (for sufficiently large dimension depending on the distance used). Efficiency of implementation of our algorithms in Apache Spark is demonstrated through experiments on a variety of datasets exhibiting speedups of several orders of magnitude.

[1]  Sanjoy Dasgupta,et al.  A cost function for similarity-based hierarchical clustering , 2015, STOC.

[2]  Kyle Fox,et al.  Parallel Algorithms for Constructing Range and Nearest-Neighbor Searching Data Structures , 2016, PODS.

[3]  Jon Feldman,et al.  On distributing symmetric streaming computations , 2008, SODA '08.

[4]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[5]  Vahab S. Mirrokni,et al.  Composable core-sets for diversity and coverage maximization , 2014, PODS.

[6]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[7]  Silvio Lattanzi,et al.  Connected Components in MapReduce and Beyond , 2014, SoCC.

[8]  Aurko Roy,et al.  Hierarchical Clustering via Spreading Metrics , 2016, NIPS.

[9]  Aditya Bhaskara,et al.  Distributed Balanced Clustering via Mapping Coresets , 2014, NIPS.

[10]  Qin Zhang,et al.  Sorting, Searching, and Simulation in the MapReduce Framework , 2011, ISAAC.

[11]  BeamePaul,et al.  Communication Steps for Parallel Query Processing , 2017 .

[12]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[13]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[14]  Ashwin Machanavajjhala,et al.  Finding connected components in map-reduce in logarithmic rounds , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[15]  György Turán,et al.  On the Computational Complexity of MapReduce , 2015, DISC.

[16]  Maria-Florina Balcan,et al.  Distributed k-means and k-median clustering on general communication topologies , 2013, NIPS.

[17]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[18]  Moses Charikar,et al.  Approximate Hierarchical Clustering via Sparsest Cut and Spreading Metrics , 2016, SODA.

[19]  Philip M. Long,et al.  Performance guarantees for hierarchical clustering , 2002, J. Comput. Syst. Sci..

[20]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[21]  Sergei Vassilvitskii,et al.  Shuffles and Circuits (On Lower Bounds for Modern Parallel Computation) , 2018, J. ACM.

[22]  Kyle Fox,et al.  Massively parallel algorithms for computing TIN DEMs and contour trees for large terrains , 2016, SIGSPATIAL/GIS.

[23]  Alexandr Andoni,et al.  Parallel algorithms for geometric graph problems , 2013, STOC.

[24]  R. Motwani,et al.  High-Dimensional Computational Geometry , 2000 .

[25]  Alok N. Choudhary,et al.  A Scalable Hierarchical Clustering Algorithm Using Spark , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[26]  Silvio Lattanzi,et al.  On Distributed Hierarchical Clustering , 2017, NIPS 2017.

[27]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .