PBoostGA: pseudo-boosting genetic algorithm for variable ranking and selection

Variable selection has consistently been a hot topic in linear regression models, especially when facing with high-dimensional data. Variable ranking, an advanced form of selection, is actually more fundamental since selection can be realized by thresholding once the variables are ranked suitably. In recent years, ensemble learning has gained a significant interest in the context of variable selection due to its great potential to improve selection accuracy and to reduce the risk of falsely including some unimportant variables. Motivated by the widespread success of boosting algorithms, a novel ensemble method PBoostGA is developed in this paper to implement variable ranking and selection in linear regression models. In PBoostGA, a weight distribution is maintained over the training set and genetic algorithm is adopted as its base learner. Initially, equal weight is assigned to each instance. According to the weight updating and ensemble member generating mechanism like AdaBoost.RT, a series of slightly different importance measures are sequentially produced for each variable. Finally, the candidate variables are ordered in the light of the average importance measure and some significant variables are then selected by a thresholding rule. Both simulation results and a real data illustration show the effectiveness of PBoostGA in comparison with some existing counterparts. In particular, PBoostGA has stronger ability to exclude redundant variables.

[1]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[2]  S. Chatterjee,et al.  Genetic algorithms and their statistical applications: an introduction , 1996 .

[3]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[4]  D. N. Kashid,et al.  Subset selection in multiple linear regression in the presence of outlier and multicollinearity , 2014 .

[5]  A. Atkinson Subset Selection in Regression , 1992 .

[6]  Cha Zhang,et al.  Ensemble Machine Learning , 2012 .

[7]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[8]  Xiaoyi Zhu,et al.  Variable selection after screening: with or without data splitting? , 2015, Comput. Stat..

[9]  Peter Bühlmann,et al.  High-dimensional variable screening and bias in subsequent inference, with an empirical comparison , 2013, Computational Statistics.

[10]  Mu Zhu,et al.  Darwinian Evolution in Parallel Universes: A Parallel Genetic Algorithm for Variable Selection , 2006, Technometrics.

[11]  Cong Liu,et al.  Two tales of variable selection for high dimensional regression: Screening and model building , 2014, Stat. Anal. Data Min..

[12]  Chunxia Zhang,et al.  RandGA: injecting randomness into parallel genetic algorithm for variable selection , 2015 .

[13]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[14]  Samia Boukir,et al.  Margin-based ordered aggregation for ensemble pruning , 2013, Pattern Recognit. Lett..

[15]  Cha Zhang,et al.  Ensemble Machine Learning: Methods and Applications , 2012 .

[16]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[17]  Chun-Xia Zhang,et al.  A Novel Bagging Ensemble Approach for Variable Ranking and Selection for Linear Regression Models , 2015, MCS.

[18]  Galit Shmueli,et al.  To Explain or To Predict? , 2010 .

[19]  Sijian Wang,et al.  RANDOM LASSO. , 2011, The annals of applied statistics.

[20]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[21]  Harris Drucker,et al.  Improving Regressors using Boosting Techniques , 1997, ICML.

[22]  Lior Rokach,et al.  Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography , 2009, Comput. Stat. Data Anal..

[23]  Anne-Laure Boulesteix,et al.  On stability issues in deriving multivariable regression models , 2015, Biometrical journal. Biometrische Zeitschrift.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[26]  Rajen Dinesh Shah,et al.  Variable selection with error control: another look at stability selection , 2011, 1105.5578.

[27]  Alípio Mário Jorge,et al.  Ensemble approaches for regression: A survey , 2012, CSUR.

[28]  Jianqing Fan,et al.  A Selective Overview of Variable Selection in High Dimensional Feature Space. , 2009, Statistica Sinica.

[29]  Chun-Xia Zhang,et al.  Boosting variable selection algorithm for linear regression models , 2014, 2014 10th International Conference on Natural Computation (ICNC).

[30]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[31]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[32]  Elizaveta Levina,et al.  Discussion of "Stability selection" by N. Meinshausen and P. Buhlmann , 2010 .

[33]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[34]  Durga L. Shrestha,et al.  Experiments with AdaBoost.RT, an Improved Boosting Scheme for Regression , 2006, Neural Computation.

[35]  Torsten Hothorn,et al.  Twin Boosting: improved feature selection and prediction , 2010, Stat. Comput..

[36]  Mu Zhu,et al.  Stochastic Stepwise Ensembles for Variable Selection , 2010, 1003.5930.

[37]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[38]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[39]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[40]  Zhi-Hua Zhou,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[41]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[42]  Mu Zhu,et al.  Variable selection by ensembles for the Cox model , 2011 .