A Selective Overview of Variable Selection in High Dimensional Feature Space.

High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods.

[1]  P. McCullagh,et al.  Generalized Linear Models , 1972, Predictive Analytics.

[2]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[3]  H. Zou,et al.  Composite quantile regression and the oracle Model Selection Theory , 2008, 0806.2905.

[4]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[5]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[6]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[7]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[8]  Jianqing Fan,et al.  Ultrahigh Dimensional Variable Selection: beyond the linear model , 2008, 0812.3201.

[9]  N. Meinshausen,et al.  Discussion: A tale of three cousins: Lasso, L2Boosting and Dantzig , 2007, 0803.3134.

[10]  Jianqing Fan,et al.  Nonconcave penalized likelihood with a diverging number of parameters , 2004, math/0406466.

[11]  S. Rosset,et al.  Piecewise linear regularized solution paths , 2007, 0708.2197.

[12]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[13]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[14]  Debashis Ghosh,et al.  Singular Value Decomposition Regression Models for Classification of Tumors from Microarray Experiments , 2001, Pacific Symposium on Biocomputing.

[15]  F. Santosa,et al.  Linear inversion of ban limit reflection seismograms , 1986 .

[16]  Malay Ghosh,et al.  Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes , 2008 .

[17]  J. W. Silverstein The Smallest Eigenvalue of a Large Dimensional Wishart Matrix , 1985 .

[18]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[19]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[20]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[22]  D. Hunter,et al.  Variable Selection using MM Algorithms. , 2005, Annals of statistics.

[23]  Runze Li,et al.  Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery , 2006, math/0602133.

[24]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[25]  H. L. Taylor,et al.  Deconvolution with the l 1 norm , 1979 .

[26]  S. Geer,et al.  Regularization in statistics: Discussion , 2006 .

[27]  A. Tsybakov,et al.  Sparsity oracle inequalities for the Lasso , 2007, 0705.3308.

[28]  Bing-Yi Jing,et al.  Self-normalized Cramér-type large deviations for independent random variables , 2003 .

[29]  Ruth M. Pfeiffer,et al.  Graphical Methods for Class Prediction Using Dimension Reduction Techniques on DNA Microarray Data , 2003, Bioinform..

[30]  C. Mallows More comments on C p , 1995 .

[31]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[32]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[33]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[34]  P. Hall,et al.  Robustness of multiple testing procedures against dependence , 2009, 0903.0464.

[35]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[36]  Roger Koenker,et al.  A note on L-estimates for linear models , 1984 .

[37]  E. Greenshtein Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint , 2006, math/0702684.

[38]  I. Johnstone,et al.  Adapting to unknown sparsity by controlling the false discovery rate , 2005, math/0505374.

[39]  Jianqing Fan,et al.  High dimensional covariance matrix estimation using a factor model , 2007, math/0701124.

[40]  Y. Ritov,et al.  Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[41]  Jianqing Fan,et al.  Regularization of Wavelet Approximations , 2001 .

[42]  Jean-Jacques Fuchs,et al.  Recovery of exact sparse representations in the presence of noise , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  M. Ledoux The concentration of measure phenomenon , 2001 .

[44]  David L. Donoho,et al.  Neighborly Polytopes And Sparse Solution Of Underdetermined Linear Equations , 2005 .

[45]  Joel A. Tropp,et al.  Just relax: convex programming methods for identifying sparse signals in noise , 2006, IEEE Transactions on Information Theory.

[46]  D. Donoho,et al.  Higher criticism thresholding: Optimal feature selection when useful features are rare and weak , 2008, Proceedings of the National Academy of Sciences.

[47]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[48]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[49]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[50]  J. Lafferty,et al.  Sparse additive models , 2007, 0711.4555.

[51]  Jianqing Fan,et al.  Nonconcave Penalized Likelihood With NP-Dimensionality , 2009, IEEE Transactions on Information Theory.

[52]  K. Lange,et al.  Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.

[53]  R. Koenker,et al.  Regression Quantiles , 2007 .

[54]  Peter Hall,et al.  Scale adjustments for classifiers in high-dimensional, low sample size settings , 2009 .

[55]  Jianqing Fan,et al.  Penalized composite quasi‐likelihood for ultrahigh dimensional variable selection , 2009, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[56]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[57]  B. Park,et al.  Choice of neighbor order in nearest-neighbor classification , 2008, 0810.5276.

[58]  Runze Li,et al.  Tuning parameter selectors for the smoothly clipped absolute deviation method. , 2007, Biometrika.

[59]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[60]  Michael Elad,et al.  Stable recovery of sparse overcomplete representations in the presence of noise , 2006, IEEE Transactions on Information Theory.

[61]  Jianqing Fan,et al.  COMMENTS ON « WAVELETS IN STATISTICS : A REVIEW , 2009 .

[62]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[63]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[64]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the lasso , 2007, 0708.3517.

[65]  H. Akaike A new look at the statistical model identification , 1974 .

[66]  Emmanuel J. Candès,et al.  Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[67]  H. Zou,et al.  One-step Sparse Estimates in Nonconcave Penalized Likelihood Models. , 2008, Annals of statistics.

[68]  Jun Zhang,et al.  On Recovery of Sparse Signals via ℓ1 Minimization , 2008, ArXiv.

[69]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[70]  Federico Girosi,et al.  An Equivalence Between Sparse Approximation and Support Vector Machines , 1998, Neural Computation.

[71]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[72]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[73]  Jiashun Jin,et al.  Impossibility of successful classification when useful features are rare and weak , 2009, Proceedings of the National Academy of Sciences.

[74]  C. Mallows Some Comments on Cp , 2000, Technometrics.

[75]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[76]  K. Lange A gradient algorithm locally equivalent to the EM algorithm , 1995 .

[77]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[78]  Z. Bai,et al.  Limit of the smallest eigenvalue of a large dimensional sample covariance matrix , 1993 .

[79]  Nicolai Meinshausen,et al.  Relaxed Lasso , 2007, Comput. Stat. Data Anal..

[80]  Wei Pan,et al.  Linear regression and two-class classification with gene expression data , 2003, Bioinform..

[81]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[82]  D. Donoho,et al.  Asymptotic Minimaxity Of False Discovery Rate Thresholding For Sparse Exponential Data , 2006, math/0602311.

[83]  J. Claerbout,et al.  Robust Modeling With Erratic Data , 1973 .

[84]  Jeffrey S. Morris,et al.  Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[85]  Hao Helen Zhang Discussion of "Sure Independence Screening for Ultra-High Dimensional Feature Space. , 2008, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[86]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[87]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[88]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[89]  Jun Zhang,et al.  On Recovery of Sparse Signals Via $\ell _{1}$ Minimization , 2008, IEEE Transactions on Information Theory.

[90]  A. Antoniadis Smoothing Noisy Data with Tapered Coiflets Series , 1996 .

[91]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[92]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[93]  Runze Li,et al.  Iterative Conditional Maximization Algorithm for Nonconcave Penalized Likelihood , 2011 .

[94]  P. Massart,et al.  Risk bounds for model selection via penalization , 1999 .

[95]  Jianqing Fan,et al.  High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[96]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[97]  V. Koltchinskii Sparse recovery in convex hulls via entropy penalization , 2009, 0905.2078.

[98]  Peter Hall,et al.  Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems , 2009 .

[99]  W. Wong,et al.  On ψ-Learning , 2003 .

[100]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[101]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[102]  Kenneth Lange,et al.  [Optimization Transfer Using Surrogate Objective Functions]: Rejoinder , 2000 .

[103]  P. Hall,et al.  Tilting methods for assessing the influence of components in a classifier , 2009 .

[104]  D. Madigan,et al.  [Least Angle Regression]: Discussion , 2004 .

[105]  D. Donoho For most large underdetermined systems of linear equations the minimal 𝓁1‐norm solution is also the sparsest solution , 2006 .

[106]  Martin J. Wainwright,et al.  Sharp thresholds for high-dimensional and noisy recovery of sparsity , 2006, ArXiv.

[107]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[108]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[109]  E. Candès,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[110]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[111]  F. Chiaromonte,et al.  Dimension reduction strategies for analyzing global gene expression data with a response. , 2002, Mathematical biosciences.

[112]  Karim Lounici Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators , 2008, 0801.4610.

[113]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[114]  Peter Hall,et al.  Edgeworth Expansion for Student's $t$ Statistic Under Minimal Moment Conditions , 1987 .

[115]  A. Boulesteix Statistical Applications in Genetics and Molecular Biology PLS Dimension Reduction for Classification with Microarray Data , 2011 .

[116]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[117]  Hongyuan Cao,et al.  Moderate deviations for two sample t-statistics , 2007 .

[118]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[119]  Stephen P. Boyd,et al.  Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[120]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[121]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[122]  Sophie Lambert-Lacroix,et al.  Effective dimension reduction methods for tumor classification using gene expression data , 2003, Bioinform..

[123]  J. Horowitz,et al.  Asymptotic properties of bridge estimators in sparse high-dimensional regression models , 2008, 0804.0693.

[124]  Yongdai Kim,et al.  Smoothly Clipped Absolute Deviation on High Dimensions , 2008 .

[125]  Z. Bai METHODOLOGIES IN SPECTRAL ANALYSIS OF LARGE DIMENSIONAL RANDOM MATRICES , A REVIEW , 1999 .

[126]  T. Hastie,et al.  [A Statistical View of Some Chemometrics Regression Tools]: Discussion , 1993 .

[127]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[128]  Jinchi Lv,et al.  A unified approach to model selection and sparse recovery using regularized least squares , 2009, 0905.3573.

[129]  Jianqing Fan,et al.  High-Dimensional Classification , 2010 .

[130]  D. Donoho,et al.  Uncertainty principles and signal recovery , 1989 .

[131]  Gareth M. James,et al.  DASSO: connections between the Dantzig selector and lasso , 2009 .

[132]  M. Ledoux Deviation Inequalities on Largest Eigenvalues , 2007 .

[133]  Cun-Hui Zhang PENALIZED LINEAR UNBIASED SELECTION , 2007 .

[134]  Xiaoming Huo,et al.  Uncertainty principles and ideal atomic decomposition , 2001, IEEE Trans. Inf. Theory.

[135]  R. Tibshirani,et al.  Discussion: The Dantzig selector: Statistical estimation when p is much larger than n , 2007, 0803.3126.

[136]  Jianqing Fan,et al.  Comments on «Wavelets in statistics: A review» by A. Antoniadis , 1997 .

[137]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[138]  S. Geer,et al.  Regularization in statistics , 2006 .