A PAC-Bayesian Approach to Formulation of Clustering Objectives

Clustering is a widely used tool for exploratory data analysis. However, the theoretical understanding of clustering is very limited. We still do not have a well-founded answer to the seemingly simple question of “how many clusters are present in the data?”, and furthermore a formal comparison of clusterings based on different optimization objectives is far beyond our abilities. The lack of good theoretical support gives rise to multiple heuristics that confuse the practitioners and stall development of the field. We suggest that the ill-posed nature of clustering problems is caused by the fact that clustering is often taken out of its subsequent application context. We argue that one does not cluster the data just for the sake of clustering it, but rather to facilitate the solution of some higher level task. By evaluation of the clustering’s contribution to the solution of the higher level task it is possible to compare different clusterings, even those obtained by different optimization objectives. In the preceding work it was shown that such an approach can be applied to evaluation and design of co-clustering solutions. Here we suggest that this approach can be extended to other settings, where clustering is applied.

[1]  Andreas Maurer,et al.  A Note on the PAC Bayesian Theorem , 2004, ArXiv.

[2]  Shai Ben-David,et al.  Relating Clustering Stability to Properties of Cluster Boundaries , 2008, COLT.

[3]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[4]  Arindam Banerjee,et al.  On Bayesian bounds , 2006, ICML.

[5]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[6]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[7]  Yevgeny Seldin A PAC-Bayesian Approach to Structure Learning , 2009 .

[8]  I. Dhillon,et al.  Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  U. V. Luxburg,et al.  Towards a Statistical Theory of Clustering , 2005 .

[10]  Naftali Tishby,et al.  Multi-classification by categorical features via clustering , 2008, ICML '08.

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  Naftali Tishby,et al.  Generalization from Observed to Unobserved Features by Clustering , 2008, J. Mach. Learn. Res..

[13]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[14]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[17]  John Shawe-Taylor,et al.  A Framework for Probability Density Estimation , 2007, AISTATS.

[18]  Ohad Shamir,et al.  On the Reliability of Clustering Stability in the Large Sample Regime , 2008, NIPS.

[19]  Noam Slonim,et al.  The Information Bottleneck : Theory and Applications , 2006 .

[20]  Naftali Tishby,et al.  PAC-Bayesian Generalization Bound for Density Estimation with Application to Co-clustering , 2009, AISTATS.