Ranked k-medoids: A fast and accurate rank-based partitioning algorithm for clustering large datasets

Clustering analysis is the process of dividing a set of objects into none-overlapping subsets. Each subset is a cluster, such that objects in the cluster are similar to one another and dissimilar to the objects in the other clusters. Most of the algorithms in partitioning approach of clustering suffer from trapping in local optimum and the sensitivity to initialization and outliers. In this paper, we introduce a novel partitioning algorithm that its initialization does not lead the algorithm to local optimum and can find all the Gaussian-shaped clusters if it has the right number of them. In this algorithm, the similarity between pairs of objects are computed once and updating the medoids in each iteration costs O(kxm) where k is the number of clusters and m is the number of objects needed to update medoids of the clusters. Comparison between our algorithm and two other partitioning algorithms is performed by using four well-known external validation measures over seven standard datasets. The results for the larger datasets show the superiority of the proposed algorithm over two other algorithms in terms of speed and accuracy.

[1]  W. L. Ruzzo,et al.  An empirical study on Principal Component Analysis for clustering gene expression data , 2000 .

[2]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[3]  Pei-Chann Chang,et al.  Data clustering and fuzzy neural network for sales forecasting: A case study in printed circuit board industry , 2009, Knowl. Based Syst..

[4]  Qiaoping Zhang,et al.  A New and Efficient K-Medoid Algorithm for Spatial Clustering , 2005, ICCSA.

[5]  Ka Yee Yeung,et al.  Details of the Adjusted Rand index and Clustering algorithms Supplement to the paper “ An empirical study on Principal Component Analysis for clustering gene expression data ” ( to appear in Bioinformatics ) , 2001 .

[6]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[7]  M. Esmel ElAlami,et al.  Supporting image retrieval framework with rule base system , 2011, Knowl. Based Syst..

[8]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[9]  V. Mani,et al.  Clustering using firefly algorithm: Performance study , 2011, Swarm Evol. Comput..

[10]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[11]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[12]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[13]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[14]  Tieli Sun,et al.  An efficient hybrid data clustering method based on K-harmonic means and Particle Swarm Optimization , 2009, Expert Syst. Appl..

[15]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[16]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[17]  Zhigang Luo,et al.  Nonlinear dimensionality reduction of gene expression data for visualization and clustering analysis of cancer tissue samples , 2010, Comput. Biol. Medicine.

[18]  Hui Xiong,et al.  External validation measures for K-means clustering: A data distribution perspective , 2009, Expert Syst. Appl..

[19]  Arash Ghanbari,et al.  Integration of genetic fuzzy systems and artificial neural networks for stock price forecasting , 2010, Knowl. Based Syst..

[20]  Julia V. Ponomarenko,et al.  Mining DNA sequences to predict sites which mutations cause genetic diseases , 2002, Knowl. Based Syst..

[21]  Michel Manfait,et al.  Automation of an algorithm based on fuzzy clustering for analyzing tumoral heterogeneity in human skin carcinoma tissue sections , 2011, Laboratory Investigation.

[22]  José David Martín-Guerrero,et al.  Studying the feasibility of a recommender in a citizen web portal based on user modeling and clustering algorithms , 2006, Expert Syst. Appl..

[23]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[24]  Erwie Zahara,et al.  A hybridized approach to data clustering , 2008, Expert Syst. Appl..

[25]  Mario Köppen,et al.  Data Swarm Clustering , 2006, Swarm Intelligence in Data Mining.

[26]  Michael N. Vrahatis,et al.  Particle Swarm Optimization and Intelligence: Advances and Applications , 2010 .

[27]  Jing Li,et al.  Ant clustering algorithm with K-harmonic means clustering , 2010, Expert Syst. Appl..

[28]  Dayou Liu,et al.  K-harmonic means data clustering with Differential Evolution , 2009, 2009 International Conference on Future BioMedical Information Engineering (FBIE).

[29]  Pasi Fränti,et al.  Iterative shrinking method for clustering problems , 2006, Pattern Recognit..

[30]  Ramiz M. Aliguliyev,et al.  Performance evaluation of density-based clustering methods , 2009, Inf. Sci..

[31]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[32]  Umeshwar Dayal,et al.  K-Harmonic Means - A Data Clustering Algorithm , 1999 .

[33]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[34]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[35]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[36]  V. P. Subramanyam Rallabandi,et al.  Knowledge-based image retrieval system , 2008, Knowl. Based Syst..

[37]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[38]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[39]  Dantong Ouyang,et al.  An artificial bee colony approach for clustering , 2010, Expert Syst. Appl..

[40]  K. Shanti Swarup,et al.  Particle swarm optimization based K-means clustering approach for security assessment in power systems , 2011, Expert Syst. Appl..

[41]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.