Agglomerative Fuzzy K-Means Clustering Algorithm

Introduction CLUSTERING is a process of grouping a set of objects into clusters so that the objects in the same cluster have high similarity but are very dissimilar with objects in other clusters. The K-Means algorithm is well known for its efficiency in clustering large data sets. Fuzzy versions of the K-Means algorithm have been reported by Ruspini and Bezdek, where each pattern is allowed to have memberships in all clusters rather than having a distinct membership to one single cluster. Numerous problems in real world applications, such as pattern recognition and computer vision, can be tackled effectively by the fuzzy K-Means algorithms, see, for instance. There are two major issues in the application of K-Means-type (nonfuzzy or fuzzy) algorithms in cluster analysis. The first issue is that the number of clusters k needs to be determined in advance as an input to these algorithms. In a real data set, k is usually unknown. In practice, different values of k are tried, and cluster validation techniques are used to measure the clustering results and determine the best value of k. The second issue is that the K-Means-type algorithms use alternating minimization methods to solve nonconvex optimization problems in finding cluster solutions. These algorithms require a set of initial cluster centres to start and often end up with different clustering results from different sets of initial cluster centres. Therefore, the K-Means-type algorithms are very sensitive to the initial cluster centres. Usually, these algorithms are run with different initial guesses of cluster centres, and the results are compared in order to determine the best clustering results. One way is to select the clustering results with the least objective function value formulated in the K-Meanstype algorithms, see, for instance. In addition, cluster validation techniques can be employed to select the best clustering result, see, for instance. Other approaches have been proposed and studied to address this issue by using a better initial seed value selection for K-Means algorithm using genetic algorithm. Recently, Arthur and Vassilvitskii proposed and studied a careful seeding for initial cluster centres to improve clustering results. In this paper, we propose an agglomerative fuzzy K-Means clustering algorithm for numerical data to tackle the above two issues in application of the K-Means-type clustering algorithms. The new algorithm is an extension to the standard fuzzy K-Means algorithm by introducing a penalty term to the objective function to make the clustering process not sensitive to the initial cluster centres. The new algorithm can produce more consistent clustering results from different sets of initial clusters centres. Combined with cluster validation techniques, the new algorithm can determine the number of clusters in a data set. Experimental Results have demonstrated the effectiveness of the new algorithm in producing consistent clustering results and determining the correct number of clusters in different data sets, some with overlapping inherent clusters.

[1]  Enrique H. Ruspini,et al.  A New Approach to Clustering , 1969, Inf. Control..

[2]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[3]  Frank Hoeppner,et al.  Fuzzy shell clustering algorithms in image processing: fuzzy C-rectangular and 2-rectangular shells , 1997, IEEE Trans. Fuzzy Syst..

[4]  Hichem Frigui,et al.  Clustering by competitive agglomeration , 1997, Pattern Recognit..

[5]  Herman Chernoff,et al.  Cluster Analysis for Applications (Michael R. Anderberg) , 1975 .

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[8]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Michael J. Laszlo,et al.  A genetic algorithm using hyper-quadtrees for low-dimensional k-means clustering , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Sadaaki Miyamoto,et al.  Fuzzy c-means as a regularization and maximum entropy approach , 1997 .

[11]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[12]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[13]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[14]  Michael J. Laszlo,et al.  A genetic algorithm that exchanges neighboring centers for k-means clustering , 2007, Pattern Recognit. Lett..

[15]  V. J. Rayward-Smith,et al.  Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition , 1999 .

[16]  Hichem Frigui,et al.  Fuzzy and possibilistic shell clustering algorithms and their application to boundary detection and surface approximation. II , 1995, IEEE Trans. Fuzzy Syst..