论文信息 - A Selective Overview of Variable Selection in High Dimensional Feature Space.

A Selective Overview of Variable Selection in High Dimensional Feature Space.

High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods.

Jianqing Fan | Jinchi Lv | Jianqing Fan | Jinchi Lv | Frederick L. Moore

[1] P. McCullagh,et al. Generalized Linear Models , 1972, Predictive Analytics.

[2] H. Zou. The Adaptive Lasso and Its Oracle Properties , 2006 .

[3] H. Zou,et al. Composite quantile regression and the oracle Model Selection Theory , 2008, 0806.2905.

[4] M. Yuan,et al. Model selection and estimation in regression with grouped variables , 2006 .

[5] H. Akaike,et al. Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[6] G. Schwarz. Estimating the Dimension of a Model , 1978 .

[7] R. Tibshirani,et al. Sparse Principal Component Analysis , 2006 .

[8] Jianqing Fan,et al. Ultrahigh Dimensional Variable Selection: beyond the linear model , 2008, 0812.3201.

[9] N. Meinshausen,et al. Discussion: A tale of three cousins: Lasso, L2Boosting and Dantzig , 2007, 0803.3134.

[10] Jianqing Fan,et al. Nonconcave penalized likelihood with a diverging number of parameters , 2004, math/0406466.

[11] S. Rosset,et al. Piecewise linear regularized solution paths , 2007, 0708.2197.

[12] P. Bickel,et al. Some theory for Fisher''s linear discriminant function , 2004 .

[13] Dean P. Foster,et al. The risk inflation criterion for multiple regression , 1994 .

[14] Debashis Ghosh,et al. Singular Value Decomposition Regression Models for Classification of Tumors from Microarray Experiments , 2001, Pacific Symposium on Biocomputing.

[15] F. Santosa,et al. Linear inversion of ban limit reflection seismograms , 1986 .

[16] Malay Ghosh,et al. Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes , 2008 .

[17] J. W. Silverstein. The Smallest Eigenvalue of a Large Dimensional Wishart Matrix , 1985 .

[18] M. R. Osborne,et al. On the LASSO and its Dual , 2000 .

[19] L. Breiman. Heuristics of instability and stabilization in model selection , 1996 .

[20] R. Tibshirani,et al. Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[21] Vladimir Vapnik,et al. The Nature of Statistical Learning , 1995 .

[22] D. Hunter,et al. Variable Selection using MM Algorithms. , 2005, Annals of statistics.

[23] Runze Li,et al. Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery , 2006, math/0602133.

[24] Trevor Hastie,et al. Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[25] H. L. Taylor,et al. Deconvolution with the l 1 norm , 1979 .

[26] S. Geer,et al. Regularization in statistics: Discussion , 2006 .

[27] A. Tsybakov,et al. Sparsity oracle inequalities for the Lasso , 2007, 0705.3308.

[28] Bing-Yi Jing,et al. Self-normalized Cramér-type large deviations for independent random variables , 2003 .

[29] Ruth M. Pfeiffer,et al. Graphical Methods for Class Prediction Using Dimension Reduction Techniques on DNA Microarray Data , 2003, Bioinform..

[30] C. Mallows. More comments on C p , 1995 .

[31] S. Dudoit,et al. Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[32] D. Ruppert. The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[33] I. Johnstone,et al. Ideal spatial adaptation by wavelet shrinkage , 1994 .

[34] P. Hall,et al. Robustness of multiple testing procedures against dependence , 2009, 0903.0464.

[35] R. Tibshirani,et al. PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[36] Roger Koenker,et al. A note on L-estimates for linear models , 1984 .

[37] E. Greenshtein. Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint , 2006, math/0702684.

[38] I. Johnstone,et al. Adapting to unknown sparsity by controlling the false discovery rate , 2005, math/0505374.

[39] Jianqing Fan,et al. High dimensional covariance matrix estimation using a factor model , 2007, math/0701124.

[40] Y. Ritov,et al. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[41] Jianqing Fan,et al. Regularization of Wavelet Approximations , 2001 .

[42] Jean-Jacques Fuchs,et al. Recovery of exact sparse representations in the presence of noise , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43] M. Ledoux. The concentration of measure phenomenon , 2001 .

[44] David L. Donoho,et al. Neighborly Polytopes And Sparse Solution Of Underdetermined Linear Equations , 2005 .

[45] Joel A. Tropp,et al. Just relax: convex programming methods for identifying sparse signals in noise , 2006, IEEE Transactions on Information Theory.

[46] D. Donoho,et al. Higher criticism thresholding: Optimal feature selection when useful features are rare and weak , 2008, Proceedings of the National Academy of Sciences.

[47] Cun-Hui Zhang,et al. The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[48] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[49] Michael Elad,et al. Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[50] J. Lafferty,et al. Sparse additive models , 2007, 0711.4555.

[51] Jianqing Fan,et al. Nonconcave Penalized Likelihood With NP-Dimensionality , 2009, IEEE Transactions on Information Theory.

[52] K. Lange,et al. Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.

[53] R. Koenker,et al. Regression Quantiles , 2007 .

[54] Peter Hall,et al. Scale adjustments for classifiers in high-dimensional, low sample size settings , 2009 .

[55] Jianqing Fan,et al. Penalized composite quasi‐likelihood for ultrahigh dimensional variable selection , 2009, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[56] Peng Zhao,et al. On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[57] B. Park,et al. Choice of neighbor order in nearest-neighbor classification , 2008, 0810.5276.

[58] Runze Li,et al. Tuning parameter selectors for the smoothly clipped absolute deviation method. , 2007, Biometrika.

[59] I. Johnstone. On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[60] Michael Elad,et al. Stable recovery of sparse overcomplete representations in the presence of noise , 2006, IEEE Transactions on Information Theory.

[61] Jianqing Fan,et al. COMMENTS ON « WAVELETS IN STATISTICS : A REVIEW , 2009 .

[62] Terence Tao,et al. The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[63] L. Breiman. Better subset regression using the nonnegative garrote , 1995 .

[64] R. Tibshirani,et al. Sparse inverse covariance estimation with the lasso , 2007, 0708.3517.

[65] H. Akaike. A new look at the statistical model identification , 1974 .

[66] Emmanuel J. Candès,et al. Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[67] H. Zou,et al. One-step Sparse Estimates in Nonconcave Penalized Likelihood Models. , 2008, Annals of statistics.

[68] Jun Zhang,et al. On Recovery of Sparse Signals via ℓ1 Minimization , 2008, ArXiv.

[69] J. Friedman,et al. A Statistical View of Some Chemometrics Regression Tools , 1993 .

[70] Federico Girosi,et al. An Equivalence Between Sparse Approximation and Support Vector Machines , 1998, Neural Computation.

[71] Jianqing Fan,et al. Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[72] Jianqing Fan,et al. Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[73] Jiashun Jin,et al. Impossibility of successful classification when useful features are rare and weak , 2009, Proceedings of the National Academy of Sciences.

[74] C. Mallows. Some Comments on Cp , 2000, Technometrics.

[75] Michael A. Saunders,et al. Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[76] K. Lange. A gradient algorithm locally equivalent to the EM algorithm , 1995 .

[77] D. Hunter,et al. Optimization Transfer Using Surrogate Objective Functions , 2000 .

[78] Z. Bai,et al. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix , 1993 .

[79] Nicolai Meinshausen,et al. Relaxed Lasso , 2007, Comput. Stat. Data Anal..

[80] Wei Pan,et al. Linear regression and two-class classification with gene expression data , 2003, Bioinform..

[81] B. Efron. Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[82] D. Donoho,et al. Asymptotic Minimaxity Of False Discovery Rate Thresholding For Sparse Exponential Data , 2006, math/0602311.

[83] J. Claerbout,et al. Robust Modeling With Erratic Data , 1973 .

[84] Jeffrey S. Morris,et al. Sure independence screening for ultrahigh dimensional feature space Discussion , 2008 .

[85] Hao Helen Zhang. Discussion of "Sure Independence Screening for Ultra-High Dimensional Feature Space. , 2008, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[86] R. Tibshirani,et al. Least angle regression , 2004, math/0406456.

[87] I. Daubechies,et al. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[88] Wenjiang J. Fu. Penalized Regressions: The Bridge versus the Lasso , 1998 .

[89] Jun Zhang,et al. On Recovery of Sparse Signals Via $\ell _{1}$ Minimization , 2008, IEEE Transactions on Information Theory.

[90] A. Antoniadis. Smoothing Noisy Data with Tapered Coiflets Series , 1996 .

[91] P. Bühlmann,et al. The group lasso for logistic regression , 2008 .

[92] S. Geer. HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[93] Runze Li,et al. Iterative Conditional Maximization Algorithm for Nonconcave Penalized Likelihood , 2011 .

[94] P. Massart,et al. Risk bounds for model selection via penalization , 1999 .

[95] Jianqing Fan,et al. High Dimensional Classification Using Features Annealed Independence Rules. , 2007, Annals of statistics.

[96] John D. Storey,et al. Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[97] V. Koltchinskii. Sparse recovery in convex hulls via entropy penalization , 2009, 0905.2078.

[98] Peter Hall,et al. Using Generalized Correlation to Effect Variable Selection in Very High Dimensional Problems , 2009 .

[99] W. Wong,et al. On ψ-Learning , 2003 .

[100] John D. Storey,et al. Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[101] Emmanuel J. Candès,et al. Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[102] Kenneth Lange,et al. [Optimization Transfer Using Surrogate Objective Functions]: Rejoinder , 2000 .

[103] P. Hall,et al. Tilting methods for assessing the influence of components in a classifier , 2009 .

[104] D. Madigan,et al. [Least Angle Regression]: Discussion , 2004 .

[105] D. Donoho. For most large underdetermined systems of linear equations the minimal 𝓁1‐norm solution is also the sparsest solution , 2006 .

[106] Martin J. Wainwright,et al. Sharp thresholds for high-dimensional and noisy recovery of sparsity , 2006, ArXiv.

[107] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[108] H. Zou,et al. Regularization and variable selection via the elastic net , 2005 .