Variable Selection in Penalized Model‐Based Clustering Via Regularization on Grouped Parameters

Summary Penalized model‐based clustering has been proposed for high‐dimensional but small sample‐sized data, such as arising from genomic studies; in particular, it can be used for variable selection. A new regularization scheme is proposed to group together multiple parameters of the same variable across clusters, which is shown both analytically and numerically to be more effective than the conventional L1 penalty for variable selection. In addition, we develop a strategy to combine this grouping scheme with grouping structured variables. Simulation studies and applications to microarray gene expression data for cancer subtype discovery demonstrate the advantage of the new proposal over several existing approaches.

[1]  Ji Zhu,et al.  Group variable selection via a hierarchical lasso and its oracle property , 2010, 1006.2871.

[2]  Ji Zhu,et al.  Variable Selection for Model‐Based High‐Dimensional Clustering and Its Application to Microarray Data , 2008, Biometrics.

[3]  Xiaotong Shen,et al.  Penalized model-based clustering with cluster-specic diagonal covariances and grouped variables , 2008 .

[4]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[5]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[6]  Ji Zhu,et al.  Improved centroids estimation for the nearest shrunken centroid classifier , 2007, Bioinform..

[7]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[8]  Marina Vannucci,et al.  Variable selection in clustering via Dirichlet process mixture models , 2006 .

[9]  Wei Pan,et al.  Semi-supervised learning via penalized mixture model with application to microarray sample classification , 2006, Bioinform..

[10]  Peter D. Hoff,et al.  Model-based subspace clustering , 2006 .

[11]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[12]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[13]  Peter D. Hoff,et al.  Subset Clustering of Binary Sequences, with an Application to Genomic Abnormality Data , 2005, Biometrics.

[14]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[15]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[16]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[17]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[18]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[19]  Olvi L. Mangasarian,et al.  Feature Selection in k-Median Clustering , 2004 .

[20]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[21]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[22]  Debashis Ghosh,et al.  Mixture modelling of gene expression data from microarray experiments , 2002, Bioinform..

[23]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[24]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[25]  Hongzhe Li,et al.  Cluster-Rasch models for microarray gene expression data , 2001, Genome Biology.

[26]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[27]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[28]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[29]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[30]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[31]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[32]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[33]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .