I-nice: A new approach for identifying the number of clusters and initial cluster centres

Abstract This paper proposes I-nice, which is a new method for automatically identifying the number of clusters and selecting the initial cluster centres in data. The method mimics a human being in observing peaks of mountains in field observation. The clusters in a dataset are considered as the hills in a field terrain. The distribution of distances between the observation point and the objects is computed. The distance distribution is modelled by a set of Gamma mixture models (GMMs), which are solved with the expectation-maximization (EM) algorithm. The best-fitted model is selected with an Akaike information criterion variant (AICc). In the I-niceSO algorithm, the number of components in the model is taken as the number of clusters, and the objects in each component are analysed with the k -nearest-neighbour method to find the initial cluster centres. For complex data with many clusters, we propose the I-niceMO algorithm, which combines the results of multiple observation points. Experimental results show that the two algorithms significantly outperformed two state-of-the-art methods (Elbow and Silhouette) in identifying the correct number of clusters in data. The results also show that I-niceMO improved the clustering accuracy and efficiency of the k -means clustering process.

[1]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[2]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[3]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[4]  R. L. Thorndike Who belongs in the family? , 1953 .

[5]  M. P. S Bhatia,et al.  Analysis of Initial Centers for k-Means Clustering Algorithm , 2013 .

[6]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[7]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[8]  Marcos Martin-Fernandez,et al.  Gamma mixture classifier for plaque detection in intravascular ultrasonic images , 2014, IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control.

[9]  Chun-Wei Tsai,et al.  A modified multiobjective EA-based clustering algorithm with automatic determination of the number of clusters , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[10]  R. Yager,et al.  Approximate Clustering Via the Mountain Method , 1994, IEEE Trans. Syst. Man Cybern. Syst..

[11]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[12]  Shaina Race,et al.  Determining the Number of Clusters via Iterative Consensus Clustering , 2014, SDM.

[13]  Dit-Yan Yeung,et al.  Robust path-based spectral clustering , 2008, Pattern Recognit..

[14]  David R. Anderson,et al.  AIC model selection and multimodel inference in behavioral ecology: some background, observations, and comparisons , 2011, Behavioral Ecology and Sociobiology.

[15]  David J. Ketchen,et al.  THE APPLICATION OF CLUSTER ANALYSIS IN STRATEGIC MANAGEMENT RESEARCH: AN ANALYSIS AND CRITIQUE , 1996 .

[16]  Ravi Jain,et al.  Innovation in the cluster validating techniques , 2008, Fuzzy Optim. Decis. Mak..

[17]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[18]  N. Sugiura Further analysts of the data by akaike' s information criterion and the finite corrections , 1978 .

[19]  Lei Xu,et al.  Automatic Cluster Number Determination via BYY Harmony Learning , 2004, ISNN.

[20]  Christian Hennig,et al.  Recovering the number of clusters in data sets with noise features using feature rescaling factors , 2015, Inf. Sci..

[21]  Shehroz S. Khan,et al.  Cluster center initialization algorithm for K-means clustering , 2004, Pattern Recognit. Lett..

[22]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[23]  Miin-Shen Yang,et al.  A modified mountain clustering algorithm , 2005, Pattern Analysis and Applications.

[24]  Adrian E. Raftery,et al.  Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering , 2007, J. Classif..

[25]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[26]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[27]  Hai Jiang,et al.  A Mixture Gamma Distribution to Model the SNR of Wireless Channels , 2011, IEEE Transactions on Wireless Communications.

[28]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[29]  Kotagiri Ramamohanarao,et al.  Automatically Determining the Number of Clusters in Unlabeled Data Sets , 2009, IEEE Transactions on Knowledge and Data Engineering.

[30]  Andrew R. Webb Gamma mixture models for target recognition , 2000, Pattern Recognit..

[31]  C. Hennig,et al.  How to find an appropriate clustering for mixed‐type variables with application to socio‐economic stratification , 2013 .

[32]  Gonzalo Vegas-Sánchez-Ferrero,et al.  A Gamma Mixture Model for IVUS Imaging , 2014 .

[33]  Clifford M. Hurvich,et al.  Regression and time series model selection in small samples , 1989 .

[34]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[35]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[36]  S. Deelers,et al.  Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with the Highest Variance , 2007 .

[37]  Jong-Seok Lee,et al.  A meta-learning approach for determining the number of clusters with consideration of nearest neighbors , 2013, Inf. Sci..

[38]  Yunming Ye,et al.  Neighborhood Density Method for Selecting Initial Cluster Centers in K-Means Clustering , 2006, PAKDD.

[39]  Shing I. Chang,et al.  Determination of cluster number in clustering microarray data , 2005, Appl. Math. Comput..

[40]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[41]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Maria Dolores Gil Montoya,et al.  A Pareto-based multi-objective evolutionary algorithm for automatic rule generation in network intrusion detection systems , 2013, Soft Comput..

[43]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[44]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .