Measuring the component overlapping in the Gaussian mixture model

The ability of a clustering algorithm to deal with overlapping clusters is a major indicator of its efficiency. However, the phenomenon of cluster overlapping is still not mathematically well characterized, especially in multivariate cases. In this paper, we are interested in the overlap phenomenon between Gaussian clusters, since the Gaussian mixture is a fundamental data distribution model suitable for many clustering algorithms. We introduce the novel concept of the ridge curve and establish a theory on the degree of overlap between two components. Based on this theory, we develop an algorithm for calculating the overlap rate. As an example, we use this algorithm to calculate the overlap rates between the classes in the IRIS data set and clear up some of the confusion as to the true number of classes in the data set. We investigate factors that affect the value of the overlap rate, and show how the theory can be used to generate “truthed data” as well as to measure the overlap rate between a given pair of clusters or components in a mixture. Finally, we show an example of application of the theory to evaluate the well known clustering algorithms.

[1]  Shengrui Wang,et al.  Distinguishing between overlapping components in mixture models , 2004, Neural Networks and Computational Intelligence.

[2]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[3]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[4]  Saman K. Halgamuge,et al.  Neural networks in designing fuzzy systems for real world applications , 1994 .

[5]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[6]  Tsuen-Ho Hsu,et al.  An Application of Fuzzy Clustering in Group-Positioning Analysis , 2000 .

[7]  Fernando Muge,et al.  Map Segmentation by Colour Cube Genetic K-Mean Clustering , 2000, ECDL.

[8]  Minh N. Do,et al.  Texture similarity measurement using Kullback-Leibler distance on wavelet subbands , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[9]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[10]  Shengrui Wang,et al.  Controlling Mixture Component Overlap for Clustering Algorithms Evaluation 1 , 2002 .

[11]  Wang Xin Comparison of Clustering Methods in Data Mining , 2006 .

[12]  Chris Fraley,et al.  Algorithms for Model-Based Gaussian Hierarchical Clustering , 1998, SIAM J. Sci. Comput..

[13]  James C. Bezdek,et al.  On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..

[14]  Isak Gath,et al.  Unsupervised Optimal Fuzzy Clustering , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Shengrui Wang,et al.  FCM-Based Model Selection Algorithms for Determining the Number of Clusters , 2004, Pattern Recognit..

[16]  Kenneth H. Nicholls,et al.  Application of fuzzy cluster analysis to Lake Simcoe crustacean zooplankton community structure , 2001 .

[17]  Randy H. Katz,et al.  Clustering Web content for efficient replication , 2002, 10th IEEE International Conference on Network Protocols, 2002. Proceedings..

[18]  N. E. Day Estimating the components of a mixture of normal distributions , 1969 .

[19]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[20]  Shengrui Wang,et al.  An objective approach to cluster validation , 2006, Pattern Recognit. Lett..

[21]  Giampiero Salvi Accent clustering in Swedish using the Bhattacharyya distance , 2003 .

[22]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .