PAC-Bayesian Analysis of Co-clustering and Beyond

We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering. We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved state-of-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of tree-shaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily extends to this problem and suggests that graph clustering should optimize the trade-off between empirical data fit and the mutual information that clusters preserve on graph nodes.

[1]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[2]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[3]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[4]  G. Strang Introduction to Linear Algebra , 1993 .

[5]  Liam Paninski Variational Minimax Estimation of Discrete Distributions under KL Loss , 2004, NIPS.

[6]  Naftali Tishby,et al.  Multi-classification by categorical features via clustering , 2008, ICML '08.

[7]  Arindam Banerjee,et al.  On Bayesian bounds , 2006, ICML.

[8]  Seungjin Choi,et al.  Nonnegative Tucker Decomposition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Seungjin Choi,et al.  Weighted Nonnegative Matrix Co-Tri-Factorization for Collaborative Prediction , 2009, ACML.

[10]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[11]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[12]  Srujana Merugu,et al.  A scalable collaborative filtering framework based on co-clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[13]  David A. McAllester Some PAC-Bayesian Theorems , 1998, COLT' 98.

[14]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[15]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[16]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Suguru Arimoto,et al.  An algorithm for computing the capacity of arbitrary discrete memoryless channels , 1972, IEEE Trans. Inf. Theory.

[18]  Kathryn B. Laskey,et al.  Latent Dirichlet Bayesian Co-Clustering , 2009, ECML/PKDD.

[19]  John Shawe-Taylor,et al.  Distribution-Dependent PAC-Bayes Priors , 2010, ALT.

[20]  U. V. Luxburg,et al.  Towards a Statistical Theory of Clustering , 2005 .

[21]  Elad Yom-Tov,et al.  Parallel Pairwise Clustering , 2009, SDM.

[22]  David A. McAllester Simplified PAC-Bayesian Margin Bounds , 2003, COLT.

[23]  Ben Taskar,et al.  Exponentiated Gradient Algorithms for Large-margin Structured Classification , 2004, NIPS.

[24]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[25]  T. Cover Admissibility Properties of Gilbert ’ s Encoding for Unknown Source Probabilities , 1998 .

[26]  W. Lockau,et al.  Contents , 2015 .

[27]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[28]  Evangelos E. Milios,et al.  Latent Dirichlet Co-Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[29]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[30]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[31]  Shai Ben-David,et al.  Relating Clustering Stability to Properties of Cluster Boundaries , 2008, COLT.

[32]  I. Dhillon,et al.  Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Polina Golland,et al.  Co-Clustering with Generative Models , 2009 .

[34]  Naftali Tishby,et al.  Information Bottleneck for Non Co-Occurrence Data , 2006, NIPS.

[35]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[36]  Koby Crammer,et al.  Gaussian Margin Machines , 2009, AISTATS.

[37]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[38]  Nathan Srebro,et al.  Learning with matrix factorizations , 2004 .

[39]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[40]  Gene H. Golub,et al.  Matrix computations , 1983 .

[41]  John Shawe-Taylor,et al.  Tighter PAC-Bayes Bounds , 2006, NIPS.

[42]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[43]  Naftali Tishby,et al.  Generalization in Clustering with Unobserved Features , 2005, NIPS.

[44]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[45]  Yevgeny Seldin,et al.  A PAC-Bayesian Analysis of Graph Clustering and Pairwise Clustering , 2010, ArXiv.

[46]  W. Bialek,et al.  Information-based clustering. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Noam Slonim,et al.  The Information Bottleneck : Theory and Applications , 2006 .

[48]  Naftali Tishby,et al.  PAC-Bayesian Generalization Bound for Density Estimation with Application to Co-clustering , 2009, AISTATS.

[49]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[50]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[51]  E. Milios,et al.  Model-based Overlapping Co-Clustering , 2006 .

[52]  Noga Alon,et al.  Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices , 2004, NIPS.

[53]  Hang Li,et al.  Word Clustering and Disambiguation Based on Co-occurrence Data , 1998, COLING.

[54]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[55]  Yishay Mansour,et al.  Generalization Bounds for Decision Trees , 2000, COLT.

[56]  Rafail E. Krichevskiy,et al.  Laplace's Law of Succession and Universal Encoding , 1998, IEEE Trans. Inf. Theory.

[57]  Matthias W. Seeger,et al.  Bayesian Gaussian process models : PAC-Bayesian generalisation error bounds and sparse approximations , 2003 .

[58]  Dayne Freitag,et al.  Trained Named Entity Recognition using Distributional Clusters , 2004, EMNLP.

[59]  Yevgeny Seldin A PAC-Bayesian Approach to Structure Learning , 2009 .

[60]  Seungjin Choi,et al.  Probabilistic matrix tri-factorization , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[61]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[62]  Miroslav Dudík,et al.  Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling , 2007, J. Mach. Learn. Res..

[63]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[64]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[65]  Ruslan Salakhutdinov,et al.  Bayesian probabilistic matrix factorization using Markov chain Monte Carlo , 2008, ICML '08.

[66]  Ran El-Yaniv,et al.  Explicit Learning Curves for Transduction and Application to Clustering and Compression Algorithms , 2004, J. Artif. Intell. Res..

[67]  Isabelle Guyon,et al.  Clustering: Science or Art? , 2009, ICML Unsupervised and Transfer Learning.

[68]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[69]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[70]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[71]  Gökhan BakIr,et al.  Generalization Bounds and Consistency for Structured Labeling , 2007 .

[72]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[73]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[74]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[75]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[76]  Dana Ron,et al.  An Experimental and Theoretical Comparison of Model Selection Methods , 1995, COLT '95.

[77]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[78]  Jean-Yves Audibert,et al.  Combining PAC-Bayesian and Generic Chaining Bounds , 2007, J. Mach. Learn. Res..

[79]  Naftali Tishby,et al.  Generalization from Observed to Unobserved Features by Clustering , 2008, J. Mach. Learn. Res..

[80]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[81]  François Laviolette,et al.  PAC-Bayesian learning of linear classifiers , 2009, ICML '09.

[82]  John Shawe-Taylor,et al.  A PAC analysis of a Bayesian estimator , 1997, COLT '97.

[83]  O. Catoni PAC-BAYESIAN SUPERVISED CLASSIFICATION: The Thermodynamics of Statistical Learning , 2007, 0712.0248.

[84]  Dayne Freitag,et al.  Towards Full Automation of Lexicon Construction , 2004, HLT-NAACL 2004.

[85]  Edgar N. Gilbert,et al.  Codes based on inaccurate source probabilities , 1971, IEEE Trans. Inf. Theory.

[86]  Thomas M. Cover Admissibility properties or Gilbert's encoding for unknown source probabilities (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[87]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[88]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[89]  Jonathan L. Herlocker,et al.  Evaluating collaborative filtering recommender systems , 2004, TOIS.

[90]  Ran El-Yaniv,et al.  Iterative Double Clustering for Unsupervised and Semi-supervised Learning , 2001, ECML.

[91]  Richard E. Blahut,et al.  Computation of channel capacity and rate-distortion functions , 1972, IEEE Trans. Inf. Theory.

[92]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[93]  Joshua B. Tenenbaum,et al.  Modelling Relational Data using Bayesian Clustered Tensor Factorization , 2009, NIPS.

[94]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[95]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[96]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[97]  Gilles Blanchard,et al.  Occam's Hammer , 2006, COLT.

[98]  Andreas Maurer,et al.  A Note on the PAC Bayesian Theorem , 2004, ArXiv.

[99]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[100]  C. Elkan,et al.  Topic Models , 2008 .

[101]  John Shawe-Taylor,et al.  A PAC-Bayes Bound for Tailored Density Estimation , 2010, ALT.

[102]  John Shawe-Taylor,et al.  PAC-Bayes & Margins , 2002, NIPS.

[103]  Arindam Banerjee,et al.  Bayesian Co-clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[104]  Ohad Shamir,et al.  On the Reliability of Clustering Stability in the Large Sample Regime , 2008, NIPS.

[105]  John Shawe-Taylor,et al.  PAC-Bayes Analysis Of Maximum Entropy Classification , 2009, International Conference on Artificial Intelligence and Statistics.