Penalized factor mixture analysis for variable selection in clustered data

A model-based clustering approach which contextually performs dimension reduction and variable selection is presented. Dimension reduction is achieved by assuming that the data have been generated by a linear factor model with latent variables modeled as Gaussian mixtures. Variable selection is performed by shrinking the factor loadings though a penalized likelihood method with an L1 penalty. A maximum likelihood estimation procedure via the EM algorithm is developed and a modified BIC criterion to select the penalization parameter is illustrated. The effectiveness of the proposed model is explored in a Monte Carlo simulation study and in a real example.

[1]  A. Montanari,et al.  Dimensionally reduced mixtures of regression models , 2011 .

[2]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[3]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Cluster Analysis , 1999 .

[4]  Paul D. McNicholas,et al.  Parsimonious Gaussian mixture models , 2008, Stat. Comput..

[5]  Jiahua Chen,et al.  Variable Selection in Finite Mixture of Regression Models , 2007 .

[6]  Tomoyuki Higuchi,et al.  A mixed factors model for dimension reduction and extraction of a group structure in gene expression data , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[7]  D. Coomans,et al.  The application of linear discriminant analysis in the diagnosis of thyroid diseases , 1978 .

[8]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[9]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[10]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Clustering, Density Estimation and Discriminant Analysis , 2002 .

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  D. Hunter,et al.  Variable Selection using MM Algorithms. , 2005, Annals of statistics.

[13]  Geoffrey J. McLachlan,et al.  Modelling high-dimensional data by mixtures of factor analyzers , 2003, Comput. Stat. Data Anal..

[14]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[15]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[16]  Peter D. Hoff,et al.  Subset Clustering of Binary Sequences, with an Application to Genomic Abnormality Data , 2005, Biometrics.

[17]  A. Montanari,et al.  Heteroscedastic factor mixture analysis , 2010 .

[18]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[19]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[20]  Jun S. Liu,et al.  Bayesian Clustering with Variable and Transformation Selections , 2003 .

[21]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[22]  Satoru Miyano,et al.  ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles , 2006, Bioinform..

[23]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[24]  Adrian E. Raftery,et al.  Enhanced Model-Based Clustering, Density Estimation, and Discriminant Analysis Software: MCLUST , 2003, J. Classif..

[25]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[26]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[27]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[28]  G. McLachlan,et al.  Mixtures of Factor Analyzers with Common Factor Loadings for the Clustering and Visualisation of High-Dimensional Data , 2008 .