A Scalable Hierarchical Clustering Algorithm Using Spark

Clustering is often an essential first step in data mining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can offer a richer representation by suggesting the potential group structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of single-linkage clustering algorithm due to its natural expression of iterative process. Our algorithm can be deployed easily in Amazon's cloud environment. And a thorough performance evaluation in Amazon's EC2 verifies that the scalability of our algorithm sustains when the datasets scale up.

[1]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[2]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[3]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[4]  Ashwin Machanavajjhala,et al.  Finding connected components in map-reduce in logarithmic rounds , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[5]  Haimonti Dutta,et al.  PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework , 2011 .

[6]  Jimmy J. Lin,et al.  Book Reviews: Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer , 2010, CL.

[7]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[8]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[9]  R. Prim Shortest connection networks and some generalizations , 1957 .

[10]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[11]  Wei-keng Liao,et al.  Parallel hierarchical clustering on shared memory platforms , 2012, 2012 19th International Conference on High Performance Computing.

[12]  Jimmy J. Lin,et al.  Large-scale machine learning at twitter , 2012, SIGMOD Conference.

[13]  Scott Shenker,et al.  Fast and Interactive Analytics over Hadoop Data with Spark , 2012, login Usenix Mag..

[14]  Edie M. Rasmussen,et al.  Efficiency of Hierarchic Agglomerative Clustering using the ICL Distributed array Processor , 1989, J. Documentation.

[15]  Bing Zhou,et al.  A parallel hierarchical clustering algorithm for PCs cluster system , 2007, Neurocomputing.

[16]  Ming Ouyang,et al.  Hierarchical Clustering with CUDA/GPU , 2009, PDCCS.

[17]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[18]  Ophir Frieder,et al.  Exploiting parallelism to support scalable hierarchical clustering , 2007, J. Assoc. Inf. Sci. Technol..

[19]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[20]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[21]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[22]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[23]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[24]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[25]  M. Patwary,et al.  DiSC : A Distributed Single-Linkage Hierarchical Clustering Algorithm using MapReduce , 2013 .