Using random subspace method for prediction and variable importance assessment in linear regression

A random subset method (RSM) with a new weighting scheme is proposed and investigated for linear regression with a large number of features. Weights of variables are defined as averages of squared values of pertaining t-statistics over fitted models with randomly chosen features. It is argued that such weighting is advisable as it incorporates two factors: a measure of importance of the variable within the considered model and a measure of goodness-of-fit of the model itself. Asymptotic weights assigned by such a scheme are determined as well as assumptions under which the method leads to consistent choice of significant variables in the model. Numerical experiments indicate that the proposed method behaves promisingly when its prediction errors are compared with errors of penalty-based methods such as the lasso and it has much smaller false discovery rate than the other methods considered.

[1]  B. Peter BOOSTING FOR HIGH-DIMENSIONAL LINEAR MODELS , 2006 .

[2]  Jan Komorowski,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm486 Data and text mining Monte Carlo , 2022 .

[3]  Barry E. Feldman Relative Importance and Value , 2005 .

[4]  U. Grömping Estimators of Relative Importance in Linear Regression Based on Variance Decomposition , 2007 .

[5]  R. Shibata Selection of the order of an autoregressive model by Akaike's information criterion , 1976 .

[6]  Marcel J. T. Reinders,et al.  Random subspace method for multivariate feature selection , 2006, Pattern Recognit. Lett..

[7]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[8]  Gérard Dreyfus,et al.  Ranking a Random Feature for Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[9]  G. Casella,et al.  Consistency of Bayesian procedures for variable selection , 2009, 0904.2978.

[10]  H. Zou,et al.  Addendum: Regularization and variable selection via the elastic net , 2005 .

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  M. Maathuis,et al.  Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm , 2009, 0906.3204.

[13]  Malgorzata Bogdan,et al.  Modified versions of Bayesian Information Criterion for genome-wide association studies , 2012, Comput. Stat. Data Anal..

[14]  W. Loh,et al.  Consistent Variable Selection in Linear Models , 1995 .

[15]  Ping Zhang On the Distributional Properties of Model Selection Criteria , 1992 .

[16]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[17]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation: Rejoinder , 1985 .

[18]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[19]  Cun-Hui Zhang,et al.  Adaptive Lasso for sparse high-dimensional regression models , 2008 .

[20]  Ulf Norinder,et al.  Molecular Descriptors Influencing Melting Point and Their Role in Classification of Solid Drugs. , 2003 .

[21]  P. Sen,et al.  Introduction to bivariate and multivariate analysis , 1981 .

[22]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[23]  Xinwei Deng,et al.  Estimation in high-dimensional linear models with deterministic design matrices , 2012, 1206.0847.

[24]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[25]  Erich Barke,et al.  Hierarchical partitioning , 1996, Proceedings of International Conference on Computer Aided Design.

[26]  Paola Zuccolotto,et al.  Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms , 2010, Stat. Comput..

[27]  R. Shibata An optimal selection of regression variables , 1981 .

[28]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[29]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.