Model based clustering for mixed data: clustMD

A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.

[1]  Ryan P. Browne,et al.  Model-based clustering, classification, and discriminant analysis of data with mixed type , 2012 .

[2]  Adrian E. Raftery,et al.  mclust Version 4 for R : Normal Mixture Modeling for Model-Based Clustering , Classification , and Density Estimation , 2012 .

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  J. Czerniak,et al.  Application of rough sets in the presumptive diagnosis of urinary system diseases , 2003 .

[5]  Jean-Paul Fox,et al.  Bayesian Item Response Modeling , 2010 .

[6]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[7]  Dimitris Karlis,et al.  Model-based clustering using copulas with applications , 2014, Statistics and Computing.

[8]  Damien McParland,et al.  CLUSTERING SOUTH AFRICAN HOUSEHOLDS BASED ON THEIR ASSET STATUS USING LATENT VARIABLE MODELS. , 2014, The annals of applied statistics.

[9]  Robert J. Boik,et al.  Identifiable finite mixtures of location models for clustering mixed-mode data , 1999, Stat. Comput..

[10]  Isabella Morlini A latent variables approach for clustering mixed binary and continuous variables within a Gaussian mixture model , 2012, Adv. Data Anal. Classif..

[11]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[12]  Jared S. Murray,et al.  Bayesian Gaussian Copula Factor Models for Mixed Data , 2011, Journal of the American Statistical Association.

[13]  Dimitris Karlis,et al.  Model-based clustering with non-elliptically contoured distributions , 2009, Stat. Comput..

[14]  Thomas Brendan Murphy,et al.  Mixture of latent trait analyzers for model-based clustering of categorical data , 2013, Statistics and Computing.

[15]  Thomas Brendan Murphy,et al.  Computational aspects of fitting mixture models via the expectation-maximization algorithm , 2012, Comput. Stat. Data Anal..

[16]  C. Viroli,et al.  A factor mixture analysis model for multivariate binary data , 2010, 1010.2314.

[17]  Xin-Yuan Song,et al.  A mixture of generalized latent variable models for mixed mode and heterogeneous data , 2011, Comput. Stat. Data Anal..

[18]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[19]  C. Biernacki,et al.  Model-based clustering of Gaussian copulas for mixed data , 2014, 1405.1299.

[20]  P. Deb Finite Mixture Models , 2008 .

[21]  Geoffrey J. McLachlan,et al.  Robust Cluster Analysis via Mixtures of Multivariate t-Distributions , 1998, SSPR/SPR.

[22]  Sylvia Frühwirth-Schnatter,et al.  Finite Mixture and Markov Switching Models , 2006 .

[23]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[24]  B. S. Everitt,et al.  A finite mixture model for the clustering of mixed-mode data , 1988 .

[25]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[26]  G. McLachlan,et al.  The EM Algorithm and Extensions: Second Edition , 2008 .

[27]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[28]  J. Geweke,et al.  Alternative computational approaches to inference in the multinomial probit model , 1994 .

[29]  Eric R. Ziegel,et al.  Data: A Collection of Problems From Many Fields for the Student and Research Worker , 1987 .

[30]  Elena A. Erosheva,et al.  A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes , 2013, 1401.2728.

[31]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[32]  B. Muthén,et al.  Finite Mixture Modeling with Mixture Outcomes Using the EM Algorithm , 1999, Biometrics.

[33]  Jim Albert,et al.  Ordinal Data Modeling , 2000 .

[34]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[35]  Murray A. Jorgensen,et al.  Theory & Methods: Mixture model clustering using the MULTIMIX program , 1999 .

[36]  Damien McParland,et al.  Clustering Ordinal Data via Latent Variable Models , 2013, Algorithms from and for Nature and Life.

[37]  Kevin M. Quinn,et al.  Bayesian Factor Analysis for Mixed Ordinal and Continuous Responses , 2004, Political Analysis.

[38]  Wojtek J. Krzanowski,et al.  Mixture separation for mixed-mode data , 1996, Stat. Comput..

[39]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[40]  Byar Dp,et al.  The choice of treatment for cancer patients based on covariate information. , 1980 .