Design, Analysis and Implementation of Modified K-Mean Algorithm for Large Data-Set to Increase Scalability and Efficiency

Clustering is an unsupervised learning technique. The main advantage of clustering analysis is a descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes. Clustering algorithms can be applied in many domains. we proposed an efficient, modified K-mean clustering algorithm to cluster large data-sets whose objective is to find out the cluster centers which are very close to the final solution for each iterative steps. Clustering is often done as a prelude to some other form of data mining or modeling. Performance of iterative clustering algorithms depends highly on the choice of cluster centers in each step. This algorithm is based on the optimization formulation of the problem and a novel iterative method. The cluster centers computed using this methodology are found to be very close to the desired cluster centers. The experimental results using the proposed algorithm with a group of randomly constructed data sets are very promising. The best algorithm in each category was found out based on their performance.

[1]  Siddheswar Ray,et al.  Determination of Number of Clusters in K-Means Clustering and Application in Colour Image Segmentation , 2000 .

[2]  S. Nithya,et al.  An Efficient Clustering Algorithm for , 2011 .

[3]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[4]  D. Botstein,et al.  For Personal Use. Only Reproduce with Permission from the Lancet Publishing Group , 2022 .

[5]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[6]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[7]  Malay K. Pakhira,et al.  Clustering of scale free networks using a k-medoid framework , 2011, 2011 2nd International Conference on Computer and Communication Technology (ICCCT-2011).

[8]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[9]  Pawan Lingras,et al.  Interval Set Clustering of Web Users with Rough K-Means , 2004, Journal of Intelligent Information Systems.

[10]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[11]  T. Velmurugan,et al.  Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points , 2010 .

[12]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[13]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[14]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[15]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[16]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[17]  Shital A. Raut,et al.  A Modified Fastmap K-Means Clustering Algorithm for Large Scale Gene Expression Datasets , 2011 .

[18]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[19]  D. Pham,et al.  Selection of K in K-means clustering , 2005 .

[20]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .