A new approach for finding appropriate number of Clusters using SVD along with determining Best Initial Centroids for K-means Algorithm

Everyday terabytes of data is generated in the real world and most of the mare stored in electronic devices, thus offering great potential for data analysis. Data is not only growing in volume, but also expanding in its varieties like text, commercial or business, medical, images, multimedia from various sources, internet being one among them. Most of these data patterns are complex and unstructured. Data analysis on unstructured data is difficult and inefficient until it is transformed into a proper structure. Clustering, in data analysis, is a vital procedure. It involves division of data objects into meaningful groups using unsupervised learning approach. Each group is called as a cluster which contains similar kind of objects and dissimilar objects in other groups. By clustering, we can identify dense and sparse regions and thereby discover all the distribution patterns and interesting correlations among data attributes. In the clustering literature, one of the most popular and simple clustering algorithms is K-means and is widely used in many applications. K-means has many challenges despite its popularity. In this paper, two significant challenges of K-means algorithm are addressed. The first challenge is toselect K value, which is number of clusters to be given by the user. The second challenge is selection of initial centroids. Methods for the above challenges are proposed and implemented.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[3]  John G. Lewis,et al.  Sparse matrix test problems , 1982, SGNM.

[4]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[5]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[6]  Murthy J.V.R,et al.  Text Document Classification based-on Least Square Support Vector Machines with Singular Value Decomposition , 2011 .

[7]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[8]  Jon R. Kettenring,et al.  The Practice of Cluster Analysis , 2006, J. Classif..

[9]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Soon Myoung Chung,et al.  Text document clustering based on neighbors , 2009, Data Knowl. Eng..

[11]  Hans-Hermann Bock,et al.  Origins and extensions of the -means algorithm in cluster analysis. , 2008 .

[12]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Pang-Ning Tan,et al.  Introduction To Data Mining”, Person Education, 2007 , 2015 .

[14]  Yu Luo,et al.  Improvement Study and Application Based on K-Means Clustering Algorithm , 2009, ICFIE.

[15]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[16]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[17]  K. Baker,et al.  Singular Value Decomposition Tutorial , 2013 .