Variable selection for general index models via sliced inverse regression

Variable selection, also known as feature selection in machine learning, plays an important role in modeling high dimensional data and is key to data-driven scientific discoveries. We consider here the problem of detecting influential variables under the general index model, in which the response is dependent of predictors through an unknown function of one or more linear combinations of them. Instead of building a predictive model of the response given combinations of predictors, we model the conditional distribution of predictors given the response. This inverse modeling perspective motivates us to propose a stepwise procedure based on likelihood-ratio tests, which is effective and computationally efficient in identifying important variables without specifying a parametric relationship between predictors and the response. For example, the proposed procedure is able to detect variables with pairwise, three-way or even higher-order interactions among $p$ predictors with a computational time of $O(p)$ instead of $O(p^k)$ (with $k$ being the highest order of interactions). Its excellent empirical performance in comparison with existing methods is demonstrated through simulation studies as well as real data examples. Consistency of the variable selection procedure when both the number of predictors and the sample size go to infinity is established.

[1]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Victor J. Yohai,et al.  The sliced inverse regression algorithm as a maximum likelihood procedure , 2009 .

[3]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[4]  Lexin Li,et al.  Sparse sufficient dimension reduction , 2007 .

[5]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[6]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[7]  S. Ranade,et al.  Stem cell transcriptome profiling via massive-scale mRNA sequencing , 2008, Nature Methods.

[8]  Runze Li,et al.  Feature Screening via Distance Correlation Learning , 2012, Journal of the American Statistical Association.

[9]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[10]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[11]  Peng Zeng,et al.  RSIR: regularized sliced inverse regression for motif discovery , 2005, Bioinform..

[12]  W. Wong,et al.  ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells , 2009, Proceedings of the National Academy of Sciences.

[13]  Robert Tibshirani,et al.  A Permutation Approach to Testing Interactions in Many Dimensions , 2012 .

[14]  R. Dennis Cook,et al.  Testing predictor contributions in sufficient dimension reduction , 2004, math/0406520.

[15]  C. Nachtsheim,et al.  Model‐free variable selection , 2005 .

[16]  Alan J. Miller Sélection of subsets of regression variables , 1984 .

[17]  R. Christensen,et al.  Fisher Lecture: Dimension Reduction in Regression , 2007, 0708.3774.

[18]  Wenxuan Zhong,et al.  Correlation pursuit: forward stepwise variable selection for index models , 2012, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[19]  R. Tibshirani,et al.  A LASSO FOR HIERARCHICAL INTERACTIONS. , 2012, Annals of statistics.

[20]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[21]  Jun S. Liu,et al.  Bayesian inference of epistatic interactions in case-control studies , 2007, Nature Genetics.

[22]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[23]  Thomas Brendan Murphy,et al.  Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications. , 2010, The annals of applied statistics.

[24]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[25]  Chun-Houh Chen,et al.  CAN SIR BE AS POPULAR AS MULTIPLE LINEAR REGRESSION , 2003 .

[26]  J. Lafferty,et al.  Sparse additive models , 2007, 0711.4555.