Unifying data units and models in (co-)clustering

Statisticians are already aware that any task (exploration, prediction) involving a modeling process is largely dependent on the measurement units for the data, to the extent that it should be impossible to provide a statistical outcome without specifying the couple (unit,model). In this work, this general principle is formalized with a particular focus on model-based clustering and co-clustering in the case of possibly mixed data types (continuous and/or categorical and/or counting features), and this opportunity is used to revisit what the related data units are. Such a formalization allows us to raise three important spots: (i) the couple (unit,model) is not identifiable so that different interpretations unit/model of the same whole modeling process are always possible; (ii) combining different “classical” units with different “classical” models should be an interesting opportunity for a cheap, wide and meaningful expansion of the whole modeling process family designed by the couple (unit,model); (iii) if necessary, this couple, up to the non-identifiability property, could be selected by any traditional model selection criterion. Some experiments on real data sets illustrate in detail practical benefits arising from the previous three spots.

[1]  Julien Jacques,et al.  A generative model for rank data based on insertion sort algorithm , 2013, Comput. Stat. Data Anal..

[2]  George A. F. Seber,et al.  Linear regression analysis , 1977 .

[3]  Gérard Govaert,et al.  Model selection in block clustering by the integrated classification likelihood , 2012 .

[4]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[5]  Gérard Govaert,et al.  Rmixmod: The R Package of the Model-Based Unsupervised, Supervised and Semi-Supervised Classification Mixmod Library , 2015 .

[6]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[7]  Zhou Xing-cai,et al.  The EM Algorithm for Factor Analyzers:An Extension with Latent Variable , 2006 .

[8]  Paul D. McNicholas,et al.  Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions , 2011, Statistics and Computing.

[9]  Xiaotong Shen,et al.  Penalized model-based clustering with unconstrained covariance matrices. , 2009, Electronic journal of statistics.

[10]  L. A. Goodman Exploratory latent structure analysis using both identifiable and unidentifiable models , 1974 .

[11]  J. Wolfe A Monte Carlo Study of the Sampling Distribution of the Likelihood Ratio for Mixtures of Multinormal Distributions , 1971 .

[12]  Joseph M. Hilbe,et al.  Modeling Count Data , 2014, International Encyclopedia of Statistical Science.

[13]  Gilles Celeux,et al.  Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models , 2015, Bioinform..

[14]  J. C. Schlimmer,et al.  Concept acquisition through representational adjustment , 1987 .

[15]  Matthieu Marbac,et al.  Variable selection for model-based clustering using the integrated complete-data likelihood , 2015, Statistics and Computing.

[16]  Walter Krämer,et al.  Review of Modern applied statistics with S, 4th ed. by W.N. Venables and B.D. Ripley. Springer-Verlag 2002 , 2003 .

[17]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[19]  Irini Moustaki,et al.  Latent class models for mixed variables with applications in Archaeometry , 2005, Comput. Stat. Data Anal..

[20]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[21]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[22]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[23]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[24]  A. Tversky,et al.  Foundations of Measurement, Vol. I: Additive and Polynomial Representations , 1991 .

[25]  Cathy Maugis-Rabusseau,et al.  A sparse variable selection procedure in model-based clustering , 2012 .

[26]  Lynette A. Hunt,et al.  Mixture model clustering with the multimix program , 1999, AISTATS.

[27]  G. Celeux,et al.  Transformation des données et comparaison de modèles pour la classification des données RNA-seq , 2015 .

[28]  V. H. Lachos,et al.  mixsmsn: Fitting Finite Mixture of Scale Mixture of Skew-Normal Distributions , 2013 .

[29]  G. Celeux,et al.  Variable Selection for Clustering with Gaussian Mixture Models , 2009, Biometrics.

[30]  Murray A. Jorgensen,et al.  Theory & Methods: Mixture model clustering using the MULTIMIX program , 1999 .

[31]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[32]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[33]  Gérard Govaert,et al.  blockcluster: An R Package for Model Based Co-Clustering , 2017 .

[34]  Julien Jacques,et al.  Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm , 2015, Statistics and Computing.

[35]  Caroline Meynet Sélection de variables pour la classification non supervisée en grande dimension , 2012 .

[36]  Jaap Van Brakel,et al.  Foundations of measurement , 1983 .

[37]  Cristina Rueda,et al.  isocir: An R Package for Constrained Inference using Isotonic Regression for Circular Data, with an Application to Cell Biology. , 2013, Journal of statistical software.

[38]  Sharon X. Lee,et al.  EMMIXuskew: An R Package for Fitting Mixtures of Multivariate Skew t Distributions via the EM Algorithm , 2012, 1211.5290.

[39]  Gilles Celeux,et al.  Variable selection in model-based clustering: A general variable role modeling , 2009, Comput. Stat. Data Anal..

[40]  Gilles Celeux,et al.  Variable selection in model-based clustering and discriminant analysis with a regularization approach , 2017, Advances in Data Analysis and Classification.

[41]  Cathy Maugis-Rabusseau,et al.  Transformation and model choice for RNA-seq co-expression analysis , 2016 .

[42]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[43]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[44]  Damien McParland,et al.  Model based clustering for mixed data: clustMD , 2015, Advances in Data Analysis and Classification.

[45]  Mohamed Nadif,et al.  Co-clustering , 2013, Encyclopedia of Database Systems.

[46]  김경민,et al.  Finite mixture models and model-based clustering , 2017 .

[47]  Christophe Biernacki,et al.  Stable and visualizable Gaussian parsimonious clustering models , 2014, Stat. Comput..

[48]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[49]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[50]  Patrick Suppes,et al.  Additive and Polynomial Representations , 2014 .

[51]  Anthony C. Atkinson,et al.  Exploratory tools for clustering multivariate data , 2007, Comput. Stat. Data Anal..

[52]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[53]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[54]  Grard Govaert Data Analysis , 2009 .

[55]  Geoffrey J. McLachlan,et al.  Modelling high-dimensional data by mixtures of factor analyzers , 2003, Comput. Stat. Data Anal..

[56]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[57]  P. McNicholas Mixture Model-Based Classification , 2016 .

[58]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[59]  D. F. Andrews,et al.  Data : a collection of problems from many fields for the student and research worker , 1985 .

[60]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Bryan F. J. Manly,et al.  Exponential Data Transformations , 1976 .

[62]  I. Thomas,et al.  The morphology of built-up landscapes in Wallonia (Belgium): A classification using fractal indices , 2008 .

[63]  Volodymyr Melnykov,et al.  Manly transformation in finite mixture modeling , 2016, Comput. Stat. Data Anal..

[64]  Paul D. McNicholas,et al.  Model-based clustering of microarray expression data via latent Gaussian mixture models , 2010, Bioinform..

[65]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[66]  D P Byar,et al.  The choice of treatment for cancer patients based on covariate information. , 1980, Bulletin du cancer.

[67]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[68]  Gérard Govaert,et al.  Estimation and selection for the latent block model on categorical data , 2015, Stat. Comput..