SparkSNN: A density-based clustering algorithm on spark

Clustering is one of the most commonly used data mining techniques. Shared nearest neighbor clustering is an important density-based clustering technique that has been widely adopted in many application domains, such as environmental science and urban computing. As the size of data becomes extremely large nowadays, it is impossible for large-scale data to be processed on a single machine. Therefore, the scalability problem of traditional clustering algorithms running on a single machine must be addressed. In this paper, we improve the traditional density-based clustering algorithm by utilizing powerful programming platform (Spark) and distributed computing clusters. In particular, we design and implement Spark-based shared nearest neighbor clustering algorithm called SparkSNN, a scalable density-based clustering algorithm on Spark for big data analysis. We conduct our experiments using real data, i.e., Maryland crime data, to evaluate the performance of the proposed algorithm with respect to speed-up and scale-up. The experimental results well confirm the effectiveness and efficiency of the proposed SparkSNN clustering algorithm.

[1]  Hiroyuki Goto,et al.  Efficient Scheduling Focusing on the Duality of MPL Representation , 2007, 2007 IEEE Symposium on Computational Intelligence in Scheduling.

[2]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[3]  A. Kala Karun,et al.  A review on hadoop — HDFS infrastructure extensions , 2013, 2013 IEEE CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES.

[4]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[5]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[6]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[7]  Prajesh P. Anchalia Improved MapReduce k-Means Clustering Algorithm with Combiner , 2014, 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation.

[8]  Di Ma,et al.  MR-DBSCAN: An Efficient Parallel Density-Based Clustering Algorithm Using MapReduce , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[9]  Bi-Ru Dai,et al.  Efficient Map/Reduce-Based DBSCAN Algorithm with Optimized Data Partition , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[10]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[11]  Anjan K. Koundinya,et al.  MapReduce Design of K-Means Clustering Algorithm , 2013, 2013 International Conference on Information Science and Applications (ICISA).