Co-clustering through Optimal Transport

In this paper, we present a novel method for co-clustering, an unsupervised learning approach that aims at discovering homogeneous groups of data instances and features by grouping them simultaneously. The proposed method uses the entropy regularized optimal transport between empirical measures defined on data instances and features in order to obtain an estimated joint probability density function represented by the optimal coupling matrix. This matrix is further factorized to obtain the induced row and columns partitions using multiscale representations approach. To justify our method theoretically, we show how the solution of the regularized optimal transport can be seen from the variational inference perspective thus motivating its use for co-clustering. The algorithm derived for the proposed method and its kernelized version based on the notion of Gromov-Wasserstein distance are fast, accurate and can determine automatically the number of both row and column clusters. These features are vividly demonstrated through extensive experimental evaluations.

[1]  Julien Rabin,et al.  Wasserstein Barycenter and Its Application to Texture Mixing , 2011, SSVM.

[2]  Arthur Cayley,et al.  The Collected Mathematical Papers: On Monge's “Mémoire sur la théorie des déblais et des remblais” , 2009 .

[3]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[4]  Joydeep Ghosh,et al.  A framework for simultaneous co-clustering and learning from complex data , 2007, KDD '07.

[5]  Alan Wee-Chung Liew,et al.  Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization , 2008, BMC Bioinformatics.

[6]  David Avis,et al.  Ground metric learning , 2011, J. Mach. Learn. Res..

[7]  Maurizio Vichi,et al.  Two-mode multi-partitioning , 2008, Comput. Stat. Data Anal..

[8]  Chun Chen,et al.  An exploration of improving collaborative recommender systems via user-item subgroups , 2012, WWW.

[9]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[10]  Facundo Mémoli,et al.  Gromov–Wasserstein Distances and the Metric Approach to Object Matching , 2011, Found. Comput. Math..

[11]  Arindam Banerjee,et al.  Residual Bayesian Co-clustering for Matrix Approximation , 2010, SDM.

[12]  David M. Blei,et al.  Stochastic Structured Variational Inference , 2014, AISTATS.

[13]  Marina Meila,et al.  Comparing subspace clusterings , 2006, IEEE Transactions on Knowledge and Data Engineering.

[14]  Nial Friel,et al.  Inferring structure in bipartite networks using the latent blockmodel and exact ICL , 2014, Network Science.

[15]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[16]  Nicolas Courty,et al.  Domain Adaptation with Regularized Optimal Transport , 2014, ECML/PKDD.

[17]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[18]  Sylvain Meignen,et al.  Nonlinear cell-average multiscale signal representations: Application to signal denoising , 2012, Signal Process..

[19]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[20]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[21]  Gabriel Peyré,et al.  Gromov-Wasserstein Averaging of Kernel and Distance Matrices , 2016, ICML.

[22]  Arnaud Doucet,et al.  Fast Computation of Wasserstein Barycenters , 2013, ICML.

[23]  Massimiliano Pontil,et al.  $K$ -Dimensional Coding Schemes in Hilbert Spaces , 2010, IEEE Transactions on Information Theory.

[24]  Francisco de A. T. de Carvalho,et al.  Dynamic clustering of histogram data based on adaptive squared Wasserstein distances , 2011, Expert Syst. Appl..

[25]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[26]  Gérard Govaert,et al.  Algorithms for Model-based Block Gaussian Clustering , 2008, DMIN.

[27]  Nicolas Courty,et al.  Mapping Estimation for Discrete Optimal Transport , 2016, NIPS.

[28]  Nial Friel,et al.  Block clustering with collapsed latent block models , 2010, Statistics and Computing.

[29]  Philip A. Knight,et al.  The Sinkhorn-Knopp Algorithm: Convergence and Applications , 2008, SIAM J. Matrix Anal. Appl..

[30]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[31]  Prateek Jain,et al.  Similarity-based Learning via Data Driven Embeddings , 2011, NIPS.

[32]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[33]  Gabriel Peyré,et al.  Iterative Bregman Projections for Regularized Transportation Problems , 2014, SIAM J. Sci. Comput..

[34]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[35]  A. Harten ENO schemes with subcell resolution , 1989 .

[36]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[37]  N. Laird Nonparametric Maximum Likelihood Estimation of a Mixing Distribution , 1978 .

[38]  C. Villani Optimal Transport: Old and New , 2008 .

[39]  S DhillonInderjit,et al.  A divisive information theoretic feature clustering algorithm for text classification , 2003 .

[40]  L. Kantorovich On the Translocation of Masses , 2006 .

[41]  G. Carlier,et al.  Tomographic Reconstruction from a Few Views: A Multi-Marginal Optimal Transport Approach , 2017 .

[42]  Kathryn B. Laskey,et al.  Latent Dirichlet Bayesian Co-Clustering , 2009, ECML/PKDD.

[43]  Richard Sinkhorn,et al.  Concerning nonnegative matrices and doubly stochastic matrices , 1967 .

[44]  L. Kantorovitch,et al.  On the Translocation of Masses , 1958 .

[45]  Srujana Merugu,et al.  A scalable collaborative filtering framework based on co-clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[46]  Gérard Govaert,et al.  Estimation and selection for the latent block model on categorical data , 2015, Stat. Comput..