Improved Parallel Clustering with Optimal Initial Centroids

Clustering of a large data set is one of the challenging tasks and it has much application in the areas such as bioinformatics, social networking, image segmentation and many others. k-means clustering is the most popular and widely used method in commercial applications and scientific research because of its simplicity. However, it has some disadvantages. The major issues are convergence to the local optima, that is, the quality of the clustering result is highly dependent on the initialization. Another problem is that the clustering will produce different results in different independent runs. The number of clusters have to be specified in advance. But in the real application, it is tough to determine the parameters in advance. This research is intended to develop a parallel clustering algorithm which is capable of clustering large data sets. An improved k-means type algorithm has been proposed that generate the optimal initial centroids using a new heuristic method and works on large data set using MapReduce methodology. The proposed method is accurate as compared to other existing methods of similar nature.

[1]  M. P. Sebastian,et al.  Improving the Accuracy and Efficiency of the k-means Clustering Algorithm , 2009 .

[2]  Simone A. Ludwig Clonal selection based fuzzy C-means algorithm for clustering , 2014, GECCO.

[3]  Christos Faloutsos,et al.  Clustering very large multi-dimensional datasets with MapReduce , 2011, KDD.

[4]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[5]  Qing Liao,et al.  An improved parallel K-means clustering algorithm with MapReduce , 2013, 2013 15th IEEE International Conference on Communication Technology.

[6]  Dexuan Zou,et al.  A novel global harmony search algorithm for chemical equation balancing , 2010, 2010 International Conference On Computer Design and Applications.

[7]  Thomas E. Potok,et al.  Document clustering using particle swarm optimization , 2005, Proceedings 2005 IEEE Swarm Intelligence Symposium, 2005. SIS 2005..

[8]  Xindong Wu,et al.  K-Means Clustering with Bagging and MapReduce , 2011, 2011 44th Hawaii International Conference on System Sciences.

[9]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[10]  Thomas Stützle,et al.  Ant colony optimization: artificial ants as a computational intelligence technique , 2006 .

[11]  Veer Sain Dixit,et al.  A Propound Method for the Improvement of Cluster Quality , 2013, ArXiv.

[12]  Vignesh Prajapati,et al.  Big Data Analytics with R and Hadoop , 2013 .

[13]  Hans A. Kestler,et al.  A highly efficient multi-core algorithm for clustering extremely large datasets , 2010, BMC Bioinformatics.

[14]  Ibrahim Aljarah,et al.  Parallel glowworm swarm optimization clustering algorithm based on MapReduce , 2014, 2014 IEEE Symposium on Swarm Intelligence.

[15]  M. M. Rahman,et al.  Improvement of K-means clustering algorithm with better initial centroids based on weighted average , 2012, 2012 7th International Conference on Electrical and Computer Engineering.

[16]  Udoh,et al.  Information Technology Education In Higher Institutions Of Learning In Nigeria:Industry Oriented Approach , 2014 .

[17]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[18]  Junjun Wang,et al.  Parallel K-PSO based on MapReduce , 2012, 2012 IEEE 14th International Conference on Communication Technology.

[19]  Christian Böhm,et al.  Determining the Convex Hull in Large Multidimensional Databases , 2001, DaWaK.

[20]  Ibrahim Aljarah,et al.  Parallel particle swarm optimization clustering algorithm based on MapReduce methodology , 2012, 2012 Fourth World Congress on Nature and Biologically Inspired Computing (NaBIC).