Reshaped Sequential Replacement for variable selection in QSPR: comparison with other reference methods

The objective of the present work was to compare the Reshaped Sequential Replacement (RSR) algorithm with other well‐known variable selection techniques in the field of Quantitative Structure–Property Relationship (QSPR) modelling. RSR algorithm is based on a simple sequential replacement procedure with the addition of several ‘reshaping’ functions that aimed to (i) ensure a faster convergence upon optimal subsets of variables and (ii) reject models affected by chance correlation, overfitting and other pathologies. In particular, three reference variable selection methods were chosen for the comparison (stepwise forward selection, genetic algorithms and particle swarm optimization), aiming to identify benefits and drawbacks of RSR with respect to these methods. To this end, several QSPR datasets regarding different physical–chemical properties and characterized by different objects/variables ratios were used to build ordinary least squares models; in addition, some well‐known (Y‐scrambling) and more recent (R‐based functions) statistical tools were used to analyse and compare the results. The study highlighted the good capability of RSR to find optimal subsets of variables in QSPR modelling, comparable or better than those found by the other reference variable selection methods. Moreover, RSR resulted to be faster than some of the analysed variable selection techniques, despite its extensive exploration of the variables space. Copyright © 2014 John Wiley & Sons, Ltd.

[1]  M. Randic,et al.  On a Fragment Approach to Structure-activity Correlations , 1989 .

[2]  Sung-Bae Cho,et al.  A Comprehensive Overview of the Applications of Artificial Life , 2006, Artificial Life.

[3]  R. Todeschini,et al.  Molecular Descriptors for Chemoinformatics: Volume I: Alphabetical Listing / Volume II: Appendices, References , 2009 .

[4]  E W Steyerberg,et al.  Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis. , 1999, Journal of clinical epidemiology.

[5]  H. Keselman,et al.  Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables , 1992 .

[6]  I. J. Myung,et al.  Applying Occam’s razor in modeling cognition: A Bayesian approach , 1997 .

[7]  Riccardo Leardi,et al.  Nature-Inspired Methods in Chemometrics: Genetic Algorithms and Artificial Neural Networks , 2005 .

[8]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[9]  G. Stewart Collinearity and Least Squares Regression , 1987 .

[10]  Xin-She Yang,et al.  Nature-Inspired Metaheuristic Algorithms: Second Edition , 2010 .

[11]  Jorge Gálvez,et al.  Charge Indexes. New Topological Descriptors , 1994, J. Chem. Inf. Comput. Sci..

[12]  R Todeschini,et al.  Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data Part 2. Variable reduction. , 2009, Analytica chimica acta.

[13]  Roberto Todeschini,et al.  Reshaped Sequential Replacement algorithm: an efficient approach to variable selection , 2014 .

[14]  Alan J. Miller Sélection of subsets of regression variables , 1984 .

[15]  David E. Goldberg,et al.  Genetic algorithms and Machine Learning , 1988, Machine Learning.

[16]  Maykel Pérez González,et al.  Variable selection methods in QSAR: an overview. , 2008, Current topics in medicinal chemistry.

[17]  Venkat Reddy Konasani,et al.  Multiple Regression Analysis , 2015 .

[18]  Lennart Eriksson,et al.  Model validation by permutation tests: Applications to variable selection , 1996 .

[19]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[20]  R. Todeschini,et al.  Detecting bad regression models: multicriteria fitness functions in regression analysis , 2004 .

[21]  R. Stolzenberg,et al.  Multiple Regression Analysis , 2004 .

[22]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[23]  Russell C. Eberhart,et al.  A new optimizer using particle swarm theory , 1995, MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science.

[24]  K. Héberger,et al.  Sum of ranking differences for method discrimination and its validation: comparison of ranks with random numbers , 2011 .

[25]  Russell C. Eberhart,et al.  A discrete binary version of the particle swarm algorithm , 1997, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[26]  R Todeschini,et al.  Canonical Measure of Correlation (CMC) and Canonical Measure of Distance (CMD) between sets of data. Part 1. Theory and simple chemometric applications. , 2009, Analytica chimica acta.

[27]  Charlotte H. Mason,et al.  Collinearity, power, and interpretation of multiple regression analysis. , 1991 .

[28]  Thomas Bäck,et al.  Evolutionary computation: Toward a new philosophy of machine intelligence , 1997, Complex..

[29]  David B. Fogel,et al.  Evolutionary Computation: Towards a New Philosophy of Machine Intelligence , 1995 .

[30]  Guo-Li Shen,et al.  Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists. , 2004, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[31]  P. Selzer,et al.  Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. , 2000, Journal of medicinal chemistry.

[32]  Arup K. Ghose,et al.  Atomic physicochemical parameters for three dimensional structure directed quantitative structure-activity relationships. 4. Additional parameters for hydrophobic and dispersive interactions and their application for an automated superposition of certain naturally occurring nucleoside antibiotics , 1989, J. Chem. Inf. Comput. Sci..

[33]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[34]  G. Di Caro,et al.  Ant colony optimization: a new meta-heuristic , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[35]  A. Ghose,et al.  Prediction of Hydrophobic (Lipophilic) Properties of Small Organic Molecules Using Fragmental Methods: An Analysis of ALOGP and CLOGP Methods , 1998 .

[36]  Davide Ballabio,et al.  Evaluation of model predictive ability by external validation techniques , 2010 .

[37]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[38]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[39]  Gordon M. Crippen,et al.  Atomic physicochemical parameters for three-dimensional-structure-directed quantitative structure-activity relationships. 2. Modeling dispersive and hydrophobic interactions , 1987, J. Chem. Inf. Comput. Sci..

[40]  Kin Keung Lai,et al.  A Bias-Variance-Complexity Trade-Off Framework for Complex System Modeling , 2006, ICCSA.

[41]  Roberto Todeschini,et al.  The K correlation index: theory development and its application in chemometrics , 1999 .