A new information theoretic analysis of sum-of-squared-error kernel clustering

The contribution of this paper is to provide a new input space analysis of the properties of sum-of-squared-error K-means clustering performed in a Mercer kernel feature space. Such an analysis has been missing until now, even though kernel K-means has been popular in the clustering literature. Our derivation extends the theory of traditional K-means from properties of mean vectors to information theoretic properties of Parzen window estimated probability density functions (pdfs). In particular, Euclidean distance-based kernel K-means is shown to maximize an integrated squared error divergence measure between cluster pdfs and the overall pdf of the data, while a cosine similarity-based approach maximizes a Cauchy-Schwarz divergence measure. Furthermore, the iterative rules which assign data points to clusters in order to maximize these criteria are shown to depend on the cluster pdfs evaluated at the data points, in addition to the Renyi entropies of the clusters. The Bayes rule is shown to be a special case.

[1]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  J. Mercer Functions of positive and negative type, and their connection with the theory of integral equations , 1909 .

[4]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[5]  Colin Fyfe,et al.  Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.

[6]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[7]  Robert Jenssen,et al.  Information Theoretic Angle-Based Spectral Clustering: A Theoretical Analysis and an Algorithm , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[8]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[9]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[10]  Yoshua Bengio,et al.  Spectral Clustering and Kernel PCA are Learning Eigenfunctions , 2003 .

[11]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[12]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[13]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[14]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[15]  Matthew P. Wand,et al.  Kernel Smoothing , 1995 .

[16]  Deniz Erdogmus,et al.  Information Theoretic Learning , 2005, Encyclopedia of Artificial Intelligence.

[17]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[18]  A. Rényi On Measures of Entropy and Information , 1961 .

[19]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[20]  L. Ryd,et al.  On bias. , 1994, Acta orthopaedica Scandinavica.

[21]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[22]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[23]  David G. Stork,et al.  Pattern Classification , 1973 .

[24]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.