Kernel Clustering: Density Biases and Solutions

Kernel methods are popular in clustering due to their generality and discriminating power. However, we show that many kernel clustering criteria have density biases theoretically explaining some practically significant artifacts empirically observed in the past. For example, we provide conditions and formally prove the density mode isolation bias in kernel K-means for a common class of kernels. We call it Breiman’s bias due to its similarity to the histogram mode isolation previously discovered by Breiman in decision tree learning with Gini impurity. We also extend our analysis to other popular kernel clustering methods, e.g., average/normalized cut or dominant sets, where density biases can take different forms. For example, splitting isolated points by cut-based criteria is essentially the sparsest subset bias, which is the opposite of the density mode bias. Our findings suggest that a principled solution for density biases in kernel clustering should directly address data inhomogeneity. We show that density equalization can be implicitly achieved using either locally adaptive weights or locally adaptive kernels. Moreover, density equalization makes many popular kernel clustering objectives equivalent. Our synthetic and real data experiments illustrate density biases and proposed solutions. We anticipate that theoretical understanding of kernel clustering limitations and their principled solutions will be important for a broad spectrum of data analysis applications across the disciplines.

[1]  Ismail Ben Ayed,et al.  Kernel Cuts: MRF meets Kernel & Spectral Clustering , 2015, 1506.07439.

[2]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[3]  Rong Zhang,et al.  A large scale clustering scheme for kernel K-Means , 2002, Object recognition supported by user interaction for service robots.

[4]  J. Gower,et al.  Metric and Euclidean properties of dissimilarity coefficients , 1986 .

[5]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Leo Breiman,et al.  Technical note: Some properties of splitting criteria , 2004, Machine Learning.

[8]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[9]  Ismail Ben Ayed,et al.  Volumetric Bias in Segmentation and Reconstruction: Secrets and Solutions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[11]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[12]  Antonio Criminisi,et al.  Decision Forests for Computer Vision and Medical Image Analysis , 2013, Advances in Computer Vision and Pattern Recognition.

[13]  Marcello Pelillo,et al.  Dominant Sets and Pairwise Clustering , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[15]  Ismail Ben Ayed,et al.  Secrets of GrabCut and Kernel K-Means , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  T. Motzkin,et al.  Maxima for Graphs and a New Proof of a Theorem of Turán , 1965, Canadian Journal of Mathematics.

[18]  Hongdong Li,et al.  Kernel Methods on Riemannian Manifolds with Gaussian RBF Kernels , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  J. Cheeger A lower bound for the smallest eigenvalue of the Laplacian , 1969 .

[21]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[22]  A. Hoffman,et al.  Lower bounds for the partitioning of graphs , 1973 .

[23]  James A. Sethian,et al.  Level Set Methods and Fast Marching Methods , 1999 .

[24]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[25]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[26]  Michael I. Jordan,et al.  Learning Spectral Clustering , 2003, NIPS.

[27]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Rong Jin,et al.  Scalable Kernel Clustering: Approximate Kernel k-means , 2014, ArXiv.

[29]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[30]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[31]  A. Izenman Review Papers: Recent Developments in Nonparametric Density Estimation , 1991 .

[32]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[33]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[34]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[35]  M. Hazelton Variable kernel density estimation , 2003 .

[36]  M. Fiedler A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory , 1975 .

[37]  Vladimir Kolmogorov,et al.  "GrabCut": interactive foreground extraction using iterated graph cuts , 2004, ACM Trans. Graph..

[38]  Joachim M. Buhmann,et al.  Optimal Cluster Preserving Embedding of Nonmetric Proximity Data , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[40]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[41]  J. Lingoes Some boundary conditions for a monotone analysis of symmetric matrices , 1971 .

[42]  Yishay Mansour,et al.  An Information-Theoretic Analysis of Hard and Soft Assignment Methods for Clustering , 1997, UAI.

[43]  Ismail Ben Ayed,et al.  Normalized Cut Meets MRF , 2016, ECCV.

[44]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..