Generalized K-Harmonic Means - Dynamic Weighting of Data in Unsupervised Learning

We propose a new class of center-based iterative clustering algorithms, K-Harmonic Means (KHMp), which is essentially insensitive to the initialization of the centers, demonstrated through many experiments. The insensitivity to initialization is attributed to a dynamic weighting function, which increases the importance of the data points that are far from any centers in the next iteration. The dependency of the K-Means’ and EM’s performance on the initialization of the centers has been a major problem. Many have tried to generate good initializations to solve the sensitivity problem. KHMp addresses the intrinsic problem by replacing the minimum distance from a data point to the centers, used in K-Means, by the Harmonic Averages of the distances from the data point to all centers. KHMp significantly improves the quality of clustering results comparing with both K-Means and EM. The KHMp algorithms have been implemented in both sequential and parallel languages and tested on hundreds of randomly generated datasets with different data distribution and clustering characteristics.

[1]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[2]  U. Fayyad,et al.  Scaling EM (Expectation Maximization) Clustering to Large Databases , 1998 .

[3]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[4]  Bin Zhang,et al.  Linear Speed-Up for a Parallel Non-Approximate Recasting of Center-Based Clustering Algorithms, including K-Means, K-Harmonic Means, and EM 1 , 2000 .

[5]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[8]  Bernd Fritzke,et al.  A Growing Neural Gas Network Learns Topologies , 1994, NIPS.

[9]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[10]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[11]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[12]  Meichun Hsu,et al.  Accurate Recasting of Parameter Estimation Algorithms Using Sufficient Statistics for Efficient Parallel Speed-Up: Demonstrated for Center-Based Data Clustering Algorithms , 2000, PKDD.

[13]  Mike Alder,et al.  Initializing the EM Algorithm for use in Gaussian Mixture Modelling , 1993 .

[14]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[15]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[16]  Philip E. Gill,et al.  Practical optimization , 1981 .

[17]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .