PARABLE: A PArallel RAndom-partition Based HierarchicaL ClustEring Algorithm for the MapReduce Framework

Large datasets, of the order of petaand terabytes, are becoming prevalent in many scientific domains including astronomy, physical sciences, bioinformatics and medicine. To effectively store, query and analyze these gigantic repositories, parallel and distributed architectures have become popular. Apache Hadoop is one such framework for supporting data-intensive applications. It provides an open source implementation of the MapReduce programming paradigm which can be used to build scalable algorithms for pattern analysis and data mining. In this paper, we present a PArallel, RAndom-partition Based hierarchicaL clustEring algorithm (PARABLE) for the MapReduce framework. It proceeds in two main steps – local hierarchical clustering on nodes using mappers and reducers and integration of results by a novel dendrogram alignment technique. Empirical results on two large data sets (High Energy Particle Physics and Intrusion Detection) from the KDDCup competition on a large cluster indicates that significant scalability benefits can be obtained by using the parallel hierarchical clustering algorithm in addition to maintaining good cluster quality.

[1]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[2]  Hui Gao,et al.  A New Agglomerative Hierarchical Clustering Algorithm Implementation based on the Map Reduce Framework , 2010, J. Digit. Content Technol. its Appl..

[3]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[4]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[5]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[6]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[7]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[8]  Sanguthevar Rajasekaran Efficient parallel hierarchical clustering algorithms , 2005, IEEE Transactions on Parallel and Distributed Systems.

[9]  GhemawatSanjay,et al.  The Google file system , 2003 .

[10]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[13]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[14]  Makho Ngazimbi DATA CLUSTERING USING MAPREDUCE , 2009 .

[15]  Nagiza F. Samatova,et al.  RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets , 2002, Distributed and Parallel Databases.

[16]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[17]  Peter Scheuermann,et al.  Efficient Parallel Hierarchical Clustering , 2004, Euro-Par.

[18]  An Adaptive Parallel Hierarchical Clustering Algorithm , 2007, HPCC.

[19]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[20]  Feng Li,et al.  An Efficient Hierarchical Clustering Method for Large Datasets with Map-Reduce , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[21]  Michael J. Cafarella,et al.  Building Nutch: Open Source Search , 2004, ACM Queue.

[22]  Fionn Murtagh,et al.  Multidimensional clustering algorithms , 1985 .

[23]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[24]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[25]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.