Reshaped Sequential Replacement algorithm: an efficient approach to variable selection

Abstract A modified version of the Sequential Replacement (SR) algorithm for variable selection is proposed, featuring modern functionalities aimed to: 1) reduce the computational time; 2) estimate the real predictivity of the model; 3) identify models suffering from pathologies. This redesigned version was called Reshaped Sequential Replacement (RSR) algorithm. The RSR algorithm was applied to several datasets in regression and classification and was compared with the original SR method by means of a Design of Experiments (DoE). The DoE took into account the functions that affect the outcome of the search in terms of generated combinations of variables and time required for computation. The results were also compared with published models on the same datasets, taken as reference, and obtained by different variable selection methods. This latter comparison showed that the RSR algorithm managed to find good subsets of variables on all datasets, even though the reference models were not always found. When the reference model was not found the RSR algorithm returned comparable or better subsets of variables, evaluated in cross-validation. The DoE showed that the inclusion of the additional functions allowed to obtain models with equivalent or better performances in a decreased computational time compared to the original SR method.

[1]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[2]  Richard Jensen,et al.  Ant colony optimization as a feature selection method in the QSAR modeling of anti-HIV-1 activities of 3-(3,5-dimethylbenzyl)uracil derivatives using MLR, PLS and SVM regressions , 2009 .

[3]  R. Stolzenberg,et al.  Multiple Regression Analysis , 2004 .

[4]  Roberto Todeschini,et al.  MobyDigs: software for regression and classification models by genetic algorithms , 2003 .

[5]  Guo-Li Shen,et al.  Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists. , 2004, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[6]  Roberto Todeschini,et al.  The K correlation index: theory development and its application in chemometrics , 1999 .

[7]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[8]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[9]  R. Leardi,et al.  Genetic algorithms applied to feature selection in PLS regression: how and when to use them , 1998 .

[10]  R. Todeschini,et al.  Detecting bad regression models: multicriteria fitness functions in regression analysis , 2004 .

[11]  E. Castro,et al.  Modified and enhanced replacement method for the selection of molecular descriptors in QSAR and QSPR theories , 2008 .

[12]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[13]  O. Mangasarian,et al.  Multisurface method of pattern separation for medical diagnosis applied to breast cytology. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Riccardo Leardi,et al.  Nature-Inspired Methods in Chemometrics: Genetic Algorithms and Artificial Neural Networks , 2005 .

[15]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[16]  R Todeschini,et al.  Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 1. Theory and simple chemometric applications. , 2009, Analytica chimica acta.

[17]  G. Smith,et al.  Food research and data analysis , 1983 .

[18]  M Pavan,et al.  Validation of a QSAR model for acute toxicity , 2006, SAR and QSAR in environmental research.

[19]  R Todeschini,et al.  Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data Part 2. Variable reduction. , 2009, Analytica chimica acta.

[20]  Roberto Todeschini,et al.  Data correlation, number of significant principal components and shape of molecules. The K correlation index , 1997 .

[21]  Roberto Todeschini,et al.  Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 3. Variable selection in classification. , 2010, Analytica chimica acta.

[22]  Roberto Todeschini,et al.  Comments on the Definition of the Q2 Parameter for QSAR Validation , 2009, J. Chem. Inf. Model..

[23]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[24]  A. Atkinson Subset Selection in Regression , 1992 .

[25]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[26]  Lennart Eriksson,et al.  Model validation by permutation tests: Applications to variable selection , 1996 .

[27]  Roberto Todeschini,et al.  Molecular descriptors for chemoinformatics , 2009 .

[28]  Alan J. Miller Sélection of subsets of regression variables , 1984 .