Clustering based on Generalized Inverse Transformation

This paper presents a novel approach which incorporates dimension extension and generalized inverse transformation (DEGIT) to realize data clustering. Unlike k-means algorithm, DEGIT needs not pre-specify the number of clusters k, centroid locations are updated and redundant centroids eliminated automatically during iterative training process. The essence of DEGIT is that clustering is performed by generalized inverse transforming the input data such that each data point is represented by a linear combination of bases with extended dimension, with each basis corresponding to a centroid and its coefficient representing the closeness between the data point and the basis. Issue of clustering validation is also addressed in this paper. First, principal component analysis is applied to detect if there exists a dominated dimension, if so, the original input data will be rotated by a certain angle w.r.t. a defined center of mass, and the resulting data undergo another run of iterative training process. After plural runs of rotation and iterative process, the labeled results from various runs are compared, a data point labeled to a centroid more times than others will be labeled to the class indexed by that wining centroid.

[1]  Stephen H. Friedberg,et al.  Linear Algebra , 2018, Computational Mathematics with SageMath.

[2]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[3]  Hichem Frigui,et al.  A Robust Competitive Clustering Algorithm With Applications in Computer Vision , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[5]  Keikichi Hirose,et al.  Single-Mixture Audio Source Separation by Subspace Decomposition of Hilbert Spectrum , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Victor K.-W. Wei,et al.  Simplified understanding and efficient decoding of a class of algebraic-geometric codes , 1994, IEEE Trans. Inf. Theory.

[7]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[8]  J. Nazuno Haykin, Simon. Neural networks: A comprehensive foundation, Prentice Hall, Inc. Segunda Edición, 1999 , 2000 .

[9]  Stephen H. Friedberg,et al.  Linear Algebra, 4th Edition , 2002 .

[10]  John F. Kolen,et al.  Reducing the time complexity of the fuzzy c-means algorithm , 2002, IEEE Trans. Fuzzy Syst..

[11]  David G. Stork,et al.  Pattern Classification , 1973 .

[12]  R. Yager,et al.  Approximate Clustering Via the Mountain Method , 1994, IEEE Trans. Syst. Man Cybern. Syst..

[13]  Ken Shoemake Polar Matrix Decomposition , 1994, Graphics Gems.

[14]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[15]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[16]  Jim Hefferon,et al.  Linear Algebra , 2012 .

[17]  J. C. Peters,et al.  Fuzzy Cluster Analysis : A New Method to Predict Future Cardiac Events in Patients With Positive Stress Tests , 1998 .

[18]  Khalil S. Hindi,et al.  Minimum-weight spanning tree algorithms A survey and empirical study , 2001, Comput. Oper. Res..

[19]  Kenneth G. Manton,et al.  Fuzzy Cluster Analysis , 2005 .

[20]  Chonghun Han,et al.  Hybrid Clustering Method for DNA Microarray Data Analysis , 2002 .

[21]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.