Feature selection in finite mixture of sparse normal linear models in high-dimensional feature space.

Rapid advancement in modern technology has allowed scientists to collect data of unprecedented size and complexity. This is particularly the case in genomics applications. One type of statistical problem in such applications is concerned with modeling an output variable as a function of a small subset of a large number of features based on relatively small sample sizes, which may even be coming from multiple subpopulations. As such, selecting the correct predictive features (variables) for each subpopulation is the key. To address this issue, we consider the problem of feature selection in finite mixture of sparse normal linear (FMSL) models in large feature spaces. We propose a 2-stage procedure to overcome computational difficulties and large false discovery rates caused by the large model space. First, to deal with the curse of dimensionality, a likelihood-based boosting is designed to effectively reduce the number of candidate features. This is the key thrust of our new method. The greatly reduced set of features is then subjected to a sparsity inducing procedure via a penalized likelihood method. A novel scheme is also proposed for the difficult problem of finding good starting points for the expectation-maximization estimation of mixture parameters. We use an extended Bayesian information criterion to determine the final FMSL model. Simulation results indicate that the procedure is successful in selecting the significant features without including a large number of insignificant ones. A real data example on gene transcription regulation is also presented.

[1]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[2]  B. Peter BOOSTING FOR HIGH-DIMENSIONAL LINEAR MODELS , 2006 .

[3]  Jinchi Lv,et al.  A unified approach to model selection and sparse recovery using regularized least squares , 2009, 0905.3573.

[4]  Xianming Tan,et al.  CONSISTENCY OF PENALIZED MLE FOR NORMAL MIXTURES IN MEAN AND VARIANCE Running Title: Consistency of Estimates in Normal Mixture , 2005 .

[5]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[6]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[7]  L. Wasserman,et al.  HIGH DIMENSIONAL VARIABLE SELECTION. , 2007, Annals of statistics.

[8]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[9]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Jiahua Chen,et al.  Variable Selection in Finite Mixture of Regression Models , 2007 .

[13]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[14]  Torsten Hothorn,et al.  Twin Boosting: improved feature selection and prediction , 2010, Stat. Comput..