Stress test procedure for feature selection algorithms

Abstract This study investigates the multicollinearity problem and the performance of feature selection methods in the case of data sets that have multicollinear features. We propose a stress test procedure for a set of feature selection methods. This procedure generates test data sets with various configurations of the target vector and features. This procedure provides more complex investigations of feature selection methods than procedures described in papers previously. A number of some multicollinear features are inserted in every configuration. A feature selection method results in a set of selected features for a given test data set. To compare given feature selection methods the procedure uses several quality measures. A criterion of the selected feature redundancy is proposed. This criterion estimates the number of multicollinear features among the selected ones. To detect multicollinearity it uses the eigensystem of the parameter covariance matrix. In computational experiments we consider the following illustrative methods: Lasso, ElasticNet, LARS, Ridge, Stepwise and Genetic algorithms and determine the best one, which solves the multicollinearity problem for every considered data set configuration.

[1]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[2]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[3]  Ronald G. Askin Multicollinearity in regression: Review and examples , 1982 .

[4]  C. Jun,et al.  Performance of some variable selection methods when multicollinearity is present , 2005 .

[5]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[6]  K. Vorontsov Combinatorial probability and the tightness of generalization bounds , 2008, Pattern Recognition and Image Analysis.

[7]  Sri Ramakrishna,et al.  FEATURE SELECTION METHODS AND ALGORITHMS , 2011 .

[8]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[9]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[10]  Yewon Kim,et al.  Solving Multicollinearity Problem Using Ridge Regression Models , 2015 .

[11]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[12]  W. W. Muir,et al.  Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1980 .

[13]  M. El-Dereny,et al.  Solving Multicollinearity Problem Using Ridge Regression Models , 2011 .

[14]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[15]  Vadim V. Strijov,et al.  Rank-scaled metric clustering of amino-acid sequences , 2012 .

[16]  R. Leardi Genetic algorithms in chemometrics and chemistry: a review , 2001 .

[17]  Vadim V. Strijov,et al.  Evidence optimization for consequently generated models , 2011, Math. Comput. Model..

[18]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[19]  L. Ladha,et al.  FEATURE SELECTION METHODS AND ALGORITHMS , 2011 .

[20]  RANJIT KUMAR PAUL,et al.  MULTICOLLINEARITY : CAUSES , EFFECTS AND REMEDIES , 2008 .

[21]  Verónica Bolón-Canedo,et al.  A review of feature selection methods on synthetic data , 2013, Knowledge and Information Systems.

[22]  Edward Leamer Multicollinearity: A Bayesian Interpretation , 1973 .

[23]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.