Greedy and Relaxed Approximations to Model Selection : A simulation study

The Minimum Description Length (MDL) principle is an important tool for retrieving knowledge from data as it embodies the scientific strife for simplicity in describing the relationship among variables. As MDL and other model selection criteria penalize models on their dimensionality, the estimation problem involves a combinatorial search over subsets of predictors and quickly becomes computationally cumbersome. Two approximation frameworks are: convex relaxation and greedy algorithms. In this article, we perform extensive simulations comparing two algorithms for generating candidate models that mimic the best subsets of predictors for given sizes (Forward Stepwise and the Least Absolute Shrinkage and Selection Operator LASSO). From the list of models determined by each method, we consider estimates chosen by two different model selection criteria (AICc and the generalized MDL criterion gMDL). The comparisons are made in terms of their selection and prediction performances. In terms of variable selection, we consider two different metrics. For the number of selection errors, our results suggest that the combination Forward Stepwise+gMDL has a better performance over different sample sizes and sparsity regimes. For the second metric of rate of true positives among the selected variables, LASSO+gMDL seems more appropriate for very small sample sizes, while Forward Stepwise+gMDL has a better performance for sample sizes at least as large as the number of factors being screened. Moreover, we found that, asymptotically, Zhao and Yu’s ((1)) irrepresentibility condition (index) has a larger impact on the selection performance of Lasso than on Forward Stepwise. In what refers to prediction performance, LASSO+AICc results in good predictive models over a wide range of sample sizes and sparsity regimes. Last but not least, these simulation results reveal that one method often can not serve for both selection and prediction purposes.

[1]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[2]  Ronald A. DeVore,et al.  Some remarks on greedy algorithms , 1996, Adv. Comput. Math..

[3]  N. Sugiura Further analysts of the data by akaike' s information criterion and the finite corrections , 1978 .

[4]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[5]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[6]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[7]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[8]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[9]  Michael Elad,et al.  A generalized uncertainty principle and sparse representation in pairs of bases , 2002, IEEE Trans. Inf. Theory.

[10]  Boris Polyak,et al.  Asymptotic Optimality of the $C_p$-Test for the Orthogonal Series Estimation of Regression , 1991 .

[11]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[12]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[13]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[14]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[15]  B. G. Quinn,et al.  The determination of the order of an autoregression , 1979 .

[16]  Peter Buhlmann Boosting for high-dimensional linear models , 2006, math/0606789.

[17]  J. Rissanen Stochastic Complexity in Statistical Inquiry Theory , 1989 .

[18]  D. Donoho For most large underdetermined systems of equations, the minimal 𝓁1‐norm near‐solution approximates the sparsest near‐solution , 2006 .

[19]  R. Nishii Asymptotic Properties of Criteria for Selection of Variables in Multiple Regression , 1984 .

[20]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[21]  Jean-Jacques Fuchs,et al.  On sparse representations in arbitrary redundant bases , 2004, IEEE Transactions on Information Theory.

[22]  Thomas C.M. Lee,et al.  Information and Complexity in Statistical Modeling , 2008 .

[23]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[24]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[26]  Xiaoming Huo,et al.  Uncertainty principles and ideal atomic decomposition , 2001, IEEE Trans. Inf. Theory.

[27]  Martin J. Wainwright,et al.  Sharp thresholds for high-dimensional and noisy recovery of sparsity , 2006, ArXiv.

[28]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[29]  J. Tropp,et al.  SIGNAL RECOVERY FROM PARTIAL INFORMATION VIA ORTHOGONAL MATCHING PURSUIT , 2005 .

[30]  Clifford M. Hurvich,et al.  Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion , 1998 .

[31]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[32]  R. Shibata An optimal selection of regression variables , 1981 .

[33]  J. Tropp Recovery of short, complex linear combinations via 𝓁1 minimization , 2005, IEEE Trans. Inf. Theory.

[34]  Ker-Chau Li,et al.  Asymptotic Optimality for $C_p, C_L$, Cross-Validation and Generalized Cross-Validation: Discrete Index Set , 1987 .

[35]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[36]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[37]  Malik Beshir Malik,et al.  Applied Linear Regression , 2005, Technometrics.

[38]  R. Shibata Asymptotic mean efficiency of a selection of regression variables , 1983 .

[39]  Yuhong Yang Can the Strengths of AIC and BIC Be Shared , 2005 .

[40]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[41]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[42]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[43]  C. L. Mallows Some comments on C_p , 1973 .

[44]  J. Shao AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION , 1997 .

[45]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[46]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[47]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[48]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[49]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[50]  X. Huo,et al.  When do stepwise algorithms meet subset selection criteria , 2007, 0708.2149.

[51]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..