SOM Clustering Using Spark-MapReduce

In this paper, we consider designing clustering algorithms that can be used in MapReduce using Spark platform, one of the most popular programming environment for processing large datasets. We focus on the practical and popular serial Self-organizing Map clustering algorithm (SOM). SOM is one of the famous unsupervised learning algorithms and it's useful for cluster analysis of large quantities of data. We have designed two scalable implementations of SOM-MapReduce algorithm. We report the experiments and demonstrated the performance in terms of classification accuracy, rand, speedup using real and synthetic data with 100 millions of points, using different cores.

[1]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[2]  Rajeev Motwani,et al.  Incremental clustering and dynamic information retrieval , 1997, STOC '97.

[3]  Sergei Vassilvitskii,et al.  A model of computation for MapReduce , 2010, SODA '10.

[4]  Yan Yang,et al.  A Parallel Cop-Kmeans Clustering Algorithm Based on MapReduce Framework , 2011 .

[5]  Bo Li,et al.  Parallel K-Means Clustering of Remote Sensing Images Based on MapReduce , 2010, WISM.

[6]  Andrey Tovchigrechko,et al.  Parallelizing BLAST and SOM Algorithms with MapReduce-MPI Library , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[7]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[8]  Andreas Rauber,et al.  Uncovering hierarchical structure in data using the growing hierarchical self-organizing map , 2002, Neurocomputing.

[9]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[10]  Jouko Lampinen,et al.  Temporal Kohonen Map and the Recurrent Self-Organizing Map: Analytical and Experimental Comparison , 2004, Neural Processing Letters.

[11]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[12]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[13]  Mustapha Lebbah,et al.  Visualization and clustering of categorical data with probabilistic self-organizing map , 2009, Neural Computing and Applications.

[14]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[15]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[16]  Fouad Badran,et al.  Probabilistic self-organizing map and radial basis function networks , 1998, Neurocomputing.

[17]  Mustapha Lebbah,et al.  Growing self-organizing trees for autonomous hierarchical clustering. , 2013, Neural networks : the official journal of the International Neural Network Society.

[18]  Samir Khuller,et al.  Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity , 2008, APPROX-RANDOM.

[19]  Bernd Fritzke,et al.  A Growing Neural Gas Network Learns Topologies , 1994, NIPS.

[20]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[21]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[22]  Ramakrishnan Kannan,et al.  NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce , 2011, KDD.

[23]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.