Model-based co-clustering for mixed type data

Abstract The importance of clustering for creating groups of observations is well known. The emergence of high-dimensional data sets with a huge number of features leads to co-clustering techniques, and several methods have been developed for simultaneously producing groups of observations and features. By grouping the data set into blocks (the crossing of a row-cluster and a column-cluster), these techniques can sometimes better summarize the data set and its inherent structure. The Latent Block Model (LBM) is a well-known method for performing co-clustering. However, recently, contexts with features of different types (here called mixed type data sets) are becoming more common. The LBM is not directly applicable to this kind of data set. Here a natural extension of the usual LBM to the “Multiple Latent Block Model” (MLBM) is proposed in order to handle mixed type data sets. Inference is performed using a Stochastic EM-algorithm that embeds a Gibbs sampler, and allows for missing data situations. A model selection criterion is defined to choose the number of row and column clusters. The method is then applied to both simulated and real data sets.

[1]  Damien McParland,et al.  Model based clustering for mixed data: clustMD , 2015, Advances in Data Analysis and Classification.

[2]  Julien Jacques,et al.  Analyzing quality of life survey using constrained co-clustering model for ordinal data and some dynamic implication , 2018 .

[3]  Mohamed Nadif,et al.  Directional co-clustering , 2019, Adv. Data Anal. Classif..

[4]  Julien Jacques,et al.  Analysing a quality‐of‐life survey by using a coclustering model for ordinal data and some dynamic implications , 2019, Journal of the Royal Statistical Society: Series C (Applied Statistics).

[5]  JacquesJulien,et al.  Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm , 2016 .

[6]  Gérard Govaert,et al.  blockcluster: An R Package for Model Based Co-Clustering , 2017 .

[7]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[8]  Mohamed Nadif,et al.  Diagonal latent block model for binary data , 2016, Statistics and Computing.

[9]  G. Govaert,et al.  Latent Block Model for Contingency Table , 2010 .

[10]  Julien Jacques,et al.  Model-based co-clustering for functional data , 2016, Neurocomputing.

[11]  Fabrice Rossi,et al.  Co-clustering de données mixtes à base des modèles de mélange , 2017, EGC.

[12]  Charles Bouveyron,et al.  The functional latent block model for the co‐clustering of electricity consumption curves , 2018 .

[13]  Julien Jacques,et al.  Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm , 2015, Statistics and Computing.

[14]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[15]  Nicoletta Del Buono,et al.  Non-negative Matrix Tri-Factorization for co-clustering: An analysis of the block matrix , 2015, Inf. Sci..

[16]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[17]  Mohamed Nadif,et al.  Co-clustering , 2013, Encyclopedia of Database Systems.

[18]  Peter E. Latham,et al.  Mutual Information , 2006 .

[19]  Vincent Brault Estimation et sélection de modèle pour le modèle des blocs latents , 2014 .

[20]  Gérard Govaert,et al.  Mutual information, phi-squared and model-based co-clustering for contingency tables , 2016, Advances in Data Analysis and Classification.

[21]  Mohamed Nadif,et al.  Model-based co-clustering for the effective handling of sparse data , 2017, Pattern Recognit..

[22]  Mohamed Nadif,et al.  Graph modularity maximization as an effective method for co-clustering text data , 2016, Knowl. Based Syst..

[23]  S. de Jong,et al.  A framework for sequential multiblock component methods , 2003 .

[25]  Brian Everitt,et al.  An Introduction to Latent Variable Models , 1984 .

[26]  S. Zarit,et al.  Dimensions of Social Support and Social Conflict as Predictors of Caregiver Depression , 1995, International Psychogeriatrics.

[27]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  D McParland,et al.  Clustering high‐dimensional mixed data to uncover sub‐phenotypes: joint analysis of phenotypic and genotypic data , 2016, Statistics in medicine.

[29]  M. Cugmas,et al.  On comparing partitions , 2015 .

[30]  Charles Bouveyron,et al.  Kernel discriminant analysis and clustering with parsimonious Gaussian process models , 2012, Statistics and Computing.

[31]  Gérard Govaert,et al.  Estimation and selection for the latent block model on categorical data , 2015, Stat. Comput..

[32]  Gilles Celeux,et al.  On Stochastic Versions of the EM Algorithm , 1995 .

[33]  Gérard Govaert,et al.  Algorithms for Model-based Block Gaussian Clustering , 2008, DMIN.

[34]  G. Celeux,et al.  Stochastic versions of the em algorithm: an experimental study in the mixture case , 1996 .

[35]  Christophe Biernacki,et al.  Model-based clustering with mixed/missing data using the new software MixtComp , 2015 .

[36]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[37]  B. Muthén,et al.  Applying Multigroup Confirmatory Factor Models for Continuous Outcomes to Likert Scale Data Complicates Meaningful Group Comparisons , 2004 .

[38]  Christophe Biernacki,et al.  Unifying data units and models in (co-)clustering , 2018, Advances in Data Analysis and Classification.

[39]  Mohamed Nadif,et al.  Sparse Poisson Latent Block Model for Document Clustering , 2017, IEEE Transactions on Knowledge and Data Engineering.

[40]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[41]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[42]  V. Robert,et al.  Classification croisée pour l’analyse de bases de données de grandes dimensions de pharmacovigilance , 2017 .

[43]  G. Huston The Hospital Anxiety and Depression Scale. , 1987, The Journal of rheumatology.

[44]  Julien Jacques,et al.  Model-based co-clustering for ordinal data , 2017, Comput. Stat. Data Anal..

[45]  C. Biernacki,et al.  Model-based clustering of Gaussian copulas for mixed data , 2014, 1405.1299.