Comparing Model Selection and Regularization Approaches to Variable Selection in Model-Based Clustering.

We compare two major approaches to variable selection in clustering: model selection and regularization. Based on previous results, we select the method of Maugis et al. (2009b), which modified the method of Raftery and Dean (2006), as a current state of the art model selection method. We select the method of Witten and Tibshirani (2010) as a current state of the art regularization method. We compared the methods by simulation in terms of their accuracy in both classification and variable selection. In the first simulation experiment all the variables were conditionally independent given cluster membership. We found that variable selection (of either kind) yielded substantial gains in classification accuracy when the clusters were well separated, but few gains when the clusters were close together. We found that the two variable selection methods had comparable classification accuracy, but that the model selection approach had substantially better accuracy in selecting variables. In our second simulation experiment, there were correlations among the variables given the cluster memberships. We found that the model selection approach was substantially more accurate in terms of both classification and variable selection than the regularization approach, and that both gave more accurate classifications than K-means without variable selection. But the model selection approach is not available in a very high dimension context.

[1]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[3]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[4]  Gilles Celeux,et al.  Sélection de variables pour la classification par mélanges gaussiens pour prédire la fonction des gènes orphelins , 2009, Monde des Util. Anal. Données.

[5]  GalimbertiGiuliano,et al.  Penalized factor mixture analysis for variable selection in clustered data , 2009 .

[6]  Angela Montanari,et al.  Penalized factor mixture analysis for variable selection in clustered data , 2009, Comput. Stat. Data Anal..

[7]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[8]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[9]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[10]  K JainAnil,et al.  Simultaneous Feature Selection and Clustering Using Mixture Models , 2004 .

[11]  E. Levina,et al.  Pairwise Variable Selection for High‐Dimensional Model‐Based Clustering , 2010, Biometrics.

[12]  M. D. Martínez-Miranda,et al.  Computational Statistics and Data Analysis , 2009 .

[13]  Wayne S. DeSarbo,et al.  Model-Based Segmentation Featuring Simultaneous Segment-Level Variable Selection , 2012 .

[14]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[15]  Geoffrey J. McLachlan,et al.  Mixtures of factor analyzers for the analysis of high-dimensional data , 2011 .

[16]  Wei Sun,et al.  Regularized k-means clustering of high-dimensional data and its asymptotic consistency , 2012 .

[17]  Jia Li,et al.  Variable Selection for Clustering by Separability Based on Ridgelines , 2012 .

[18]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[19]  Doreen Pfeifer,et al.  Statistics and Data Analysis , 1997 .

[20]  Anthony C. Davison,et al.  High-Dimensional Bayesian Clustering with Variable Selection: The R Package bclust , 2012 .

[21]  Ji Zhu,et al.  Variable Selection for Model‐Based High‐Dimensional Clustering and Its Application to Microarray Data , 2008, Biometrics.

[22]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[23]  Wei Pan,et al.  Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. , 2008, Electronic journal of statistics.

[24]  W. Meredith,et al.  Statistics and Data Analysis , 1974 .

[25]  Frédérique Bitton,et al.  CATdb: a public access to Arabidopsis transcriptome data from the URGV-CATMA platform , 2007, Nucleic Acids Res..

[26]  Luca Scrucca,et al.  Dimension reduction for model-based clustering , 2015, Stat. Comput..

[27]  M. Brusco,et al.  Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures , 2008 .

[28]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[29]  Ricardo Fraiman,et al.  Selection of Variables for Cluster Analysis and Classification Rules , 2006, math/0610757.

[30]  Gilles Celeux,et al.  Variable selection in model-based clustering: A general variable role modeling , 2009, Comput. Stat. Data Anal..

[31]  Geoffrey J. McLachlan,et al.  Mixtures of Factor Analyzers , 2000, International Conference on Machine Learning.

[32]  G. Celeux,et al.  Variable Selection for Clustering with Gaussian Mixture Models , 2009, Biometrics.

[33]  S. Merhar,et al.  Letter to the editor , 2005, IEEE Communications Magazine.

[34]  Xiaotong Shen,et al.  Penalized model-based clustering with unconstrained covariance matrices. , 2009, Electronic journal of statistics.

[35]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[36]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..

[37]  Camille Brunet,et al.  Discriminative variable selection for clustering with the sparse Fisher-EM algorithm , 2012, Computational Statistics.

[38]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[39]  M. Cugmas,et al.  On comparing partitions , 2015 .

[40]  Cordelia Schmid,et al.  High-dimensional data clustering , 2006, Comput. Stat. Data Anal..

[41]  Paul D. McNicholas,et al.  Parsimonious Gaussian mixture models , 2008, Stat. Comput..

[42]  Tengfei Liu,et al.  Model-based clustering of high-dimensional data: Variable selection versus facet determination , 2013, Int. J. Approx. Reason..

[43]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[44]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..