Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan and Lv, 2008) or feature selection using a two-sample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan and Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology.

[1]  M. M. Siddiqui,et al.  Robust Estimation of Location , 1967 .

[2]  H. Akaike A new look at the statistical model identification , 1974 .

[3]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[4]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[5]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[6]  Thomas G. Dietterich,et al.  Learning Boolean Concepts in the Presence of Many Irrelevant Features , 1994, Artif. Intell..

[7]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[9]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[10]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[11]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[12]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[15]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[16]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[17]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[18]  Jianqing Fan,et al.  Regularization of Wavelet Approximations , 2001 .

[19]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[20]  D. Donoho,et al.  Maximal Sparsity Representation via l 1 Minimization , 2002 .

[21]  Nicolas Chapados,et al.  Extensions to Metric-Based Model Selection , 2003, J. Mach. Learn. Res..

[22]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[23]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[24]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[25]  Yi Lin Multicategory Support Vector Machines, Theory, and Application to the Classification of . . . , 2003 .

[26]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[27]  Jinbo Bi,et al.  Dimensionality Reduction via Sparse Support Vector Machines , 2003, J. Mach. Learn. Res..

[28]  Jianqing Fan,et al.  Nonconcave penalized likelihood with a diverging number of parameters , 2004, math/0406466.

[29]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[30]  D. Madigan,et al.  [Least Angle Regression]: Discussion , 2004 .

[31]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[32]  B. Turlach Discussion of "Least Angle Regression" by Efron, Hastie, Johnstone and Tibshirani , 2004 .

[33]  Yufeng Liu,et al.  Multicategory ψ-Learning and Support Vector Machine: Computational Tools , 2005 .

[34]  Mee Young Park,et al.  L 1-regularization path algorithm for generalized linear models , 2006 .

[35]  Patrick Warnat,et al.  Customized oligonucleotide microarray gene expression-based classification of neuroblastoma patients outperforms current clinical risk stratification. , 2006, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[36]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[37]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[38]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[39]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[40]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[41]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[42]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[43]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[44]  E. Candès,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[45]  Cun-Hui Zhang PENALIZED LINEAR UNBIASED SELECTION , 2007 .

[46]  R. Tibshirani,et al.  "Preconditioning" for feature selection and regression in high-dimensional problems , 2007, math/0703858.

[47]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[48]  Huan Liu,et al.  Searching for Interacting Features , 2007, IJCAI.

[49]  H. Zou,et al.  One-step Sparse Estimates in Nonconcave Penalized Likelihood Models. , 2008, Annals of statistics.

[50]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[51]  C. Robert Discussion of "Sure independence screening for ultra-high dimensional feature space" by Fan and Lv. , 2008 .

[52]  B. Efron Rejoinder: Microarrays, Empirical Bayes and the Two-Groups Model , 2008, 0808.0603.

[53]  Jeffrey S. Morris,et al.  Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[54]  H. Zou,et al.  One-step Sparse Estimates in Nonconcave Penalized Likelihood Models. , 2008, Annals of statistics.

[55]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[56]  P. Hall,et al.  Tilting methods for assessing the influence of components in a classifier , 2009 .

[57]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[58]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[59]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[60]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.