Diagonal latent block model for binary data

This paper addresses the problem of co-clustering binary data in the latent block model framework with diagonal constraints for resulting data partitions. We consider the Bernoulli generative mixture model and present three new methods differing in the assumptions made about the degree of homogeneity of diagonal blocks. The proposed models are parsimonious and allow to take into account the structure of a data matrix when reorganizing it into homogeneous diagonal blocks. We derive algorithms for each of the presented models based on the classification expectation-maximization algorithm which maximizes the complete data likelihood. We show that our contribution can outperform other state-of-the-art (co)-clustering methods on synthetic sparse and non-sparse data. We also prove the efficiency of our approach in the context of document clustering, by using real-world benchmark data sets.

[1]  Hans-Hermann Bock,et al.  Two-mode clustering methods: astructuredoverview , 2004, Statistical methods in medical research.

[2]  G. Govaert,et al.  Block Bernoulli Parsimonious Clustering Models , 2007 .

[3]  Ata Kabán,et al.  Factorisation and denoising of 0-1 data: A variational approach , 2008, Neurocomputing.

[4]  Gérard Govaert,et al.  Block clustering with Bernoulli mixture models: Comparison of different approaches , 2008, Comput. Stat. Data Anal..

[5]  Srujana Merugu,et al.  A scalable collaborative filtering framework based on co-clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[6]  Luo Si,et al.  Flexible Mixture Model for Collaborative Filtering , 2003, ICML.

[7]  Nial Friel,et al.  Block clustering with collapsed latent block models , 2010, Statistics and Computing.

[8]  Marina Meila,et al.  Comparing subspace clusterings , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Maurizio Vichi,et al.  Studies in Classification Data Analysis and knowledge Organization , 2011 .

[10]  Thomas Hofmann,et al.  Latent Class Models for Collaborative Filtering , 1999, IJCAI.

[11]  F. Marcotorchino,et al.  Block seriation problems: A unified approach. Reply to the problem of H. Garcia and J. M. Proth (Applied Stochastic Models and Data Analysis, 1, (1), 25–34 (1985)) , 1987 .

[12]  M. Cugmas,et al.  On comparing partitions , 2015 .

[13]  G. Govaert,et al.  Latent Block Model for Contingency Table , 2010 .

[14]  Tao Li,et al.  A general model for clustering binary data , 2005, KDD '05.

[15]  H. Bock Convexity-based clustering criteria: theory, algorithms, and applications in statistics , 2004 .

[16]  Seokho Lee,et al.  A biclustering algorithm for binary matrices based on penalized Bernoulli likelihood , 2014, Stat. Comput..

[17]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[18]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[19]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Mark A. Girolami,et al.  The topographic organization and visualization of binary data using multivariate-Bernoulli latent variable models , 2001, IEEE Trans. Neural Networks.

[21]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Aurore Lomet,et al.  Sélection de modèle pour la classification croisée de données continues , 2013 .

[23]  Michael J. Symons,et al.  Clustering criteria and multivariate normal mixtures , 1981 .

[24]  Gérard Govaert,et al.  An EM algorithm for the block mixture model , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Gérard Govaert La classification croisée , 1989, Monde des Util. Anal. Données.

[26]  Vladimir Batagelj,et al.  Generalized blockmodeling , 2005, Structural analysis in the social sciences.

[27]  Vladimir Batagelj,et al.  Fitting Pre-specified Blockmodels , 1998 .

[28]  Gérard Govaert,et al.  Clustering with block mixture models , 2003, Pattern Recognit..

[29]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[30]  Gérard Govaert,et al.  Estimation and selection for the latent block model on categorical data , 2015, Stat. Comput..

[31]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[32]  Vichi Maurizio Double k-means Clustering for Simultaneous Classification of Objects and Variables , 2001 .

[33]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[34]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[35]  Mohamed Nadif,et al.  Co-clustering for Binary and Categorical Data with Maximum Modularity , 2011, 2011 IEEE 11th International Conference on Data Mining.

[36]  Richard Paap,et al.  A Bayesian approach to two-mode clustering , 2009 .

[37]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[38]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.