Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO

Variable selection is of crucial significance in QSAR modeling since it increases the model predictive ability and reduces noise. The selection of the right variables is far more complicated than the development of predictive models. In this study, eight continuous and categorical data sets were employed to explore the applicability of two distinct variable selection methods random forests (RF) and least absolute shrinkage and selection operator (LASSO). Variable selection was performed: (1) by using recursive random forests to rule out a quarter of the least important descriptors at each iteration and (2) by using LASSO modeling with 10-fold inner cross-validation to tune its penalty λ for each data set. Along with regular statistical parameters of model performance, we proposed the highest pairwise correlation rate, average pairwise Pearson's correlation coefficient, and Tanimoto coefficient to evaluate the optimal by RF and LASSO in an extensive way. Results showed that variable selection could allow a tremendous reduction of noisy descriptors (at most 96% with RF method in this study) and apparently enhance model's predictive performance as well. Furthermore, random forests showed property of gathering important predictors without restricting their pairwise correlation, which is contrary to LASSO. The mutual exclusion of highly correlated variables in LASSO modeling tends to skip important variables that are highly related to response endpoints and thus undermine the model's predictive performance. The optimal variables selected by RF share low similarity with those by LASSO (e.g., the Tanimoto coefficients were smaller than 0.20 in seven out of eight data sets). We found that the differences between RF and LASSO predictive performances mainly resulted from the variables selected by different strategies rather than the learning algorithms. Our study showed that the right selection of variables is more important than the learning algorithm for modeling. We hope that a standard procedure could be developed based on these proposed statistical metrics to select the truly important variables for model interpretation, as well as for further use to facilitate drug discovery and environmental toxicity assessment.

[1]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[2]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[3]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database: update 2013 , 2012, Nucleic Acids Res..

[4]  L Xue,et al.  Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening. , 2000, Combinatorial chemistry & high throughput screening.

[5]  Shu-Shen Liu,et al.  The Use of Pseudo-Equilibrium Constant Affords Improved QSAR Models of Human Plasma Protein Binding , 2013, Pharmaceutical Research.

[6]  Hxugo Kubiny Variable Selection in QSAR Studies. I. An Evolutionary Algorithm , 1994 .

[7]  D. Dix,et al.  The ToxCast program for prioritizing toxicity testing of environmental chemicals. , 2007, Toxicological sciences : an official journal of the Society of Toxicology.

[8]  Igor V. Tetko,et al.  Combinatorial QSAR Modeling of Chemical Toxicants Tested against Tetrahymena pyriformis , 2008, J. Chem. Inf. Model..

[9]  Erik Evensen,et al.  A computational ensemble pharmacophore model for identifying substrates of P-glycoprotein. , 2002, Journal of medicinal chemistry.

[10]  Hao Zhu,et al.  Big Data in Chemical Toxicity Research: The Use of High-Throughput Screening Assays To Identify Potential Toxicants , 2014, Chemical research in toxicology.

[11]  Jean-Michel Poggi,et al.  Variable selection using random forests , 2010, Pattern Recognit. Lett..

[12]  Scott Boyer,et al.  Benchmarking Variable Selection in QSAR , 2012, Molecular informatics.

[13]  Alexander Tropsha,et al.  Novel Variable Selection Quantitative Structure-Property Relationship Approach Based on the k-Nearest-Neighbor Principle , 2000, J. Chem. Inf. Comput. Sci..

[14]  Michael C. Hutter,et al.  Determining the Degree of Randomness of Descriptors in Linear Regression Equations with Respect to the Data Size , 2011, J. Chem. Inf. Model..

[15]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[16]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[17]  U. Grömping Dependence of Variable Importance in Random Forests on the Shape of the Regressor Space , 2009 .

[18]  Eva Petkova,et al.  A comparative study of variable selection methods in the context of developing psychiatric screening instruments , 2014, Statistics in medicine.

[19]  Hao Tang,et al.  Novel Inhibitors of Human Histone Deacetylase (HDAC) Identified by QSAR Modeling of Known Inhibitors, Virtual Screening, and Experimental Validation , 2009, J. Chem. Inf. Model..

[20]  A. Tropsha,et al.  Human Intestinal Transporter Database: QSAR Modeling and Virtual Profiling of Drug Uptake, Efflux and Interactions , 2013, Pharmaceutical Research.

[21]  C. Furlanello,et al.  Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products , 2006 .

[22]  Maykel Pérez González,et al.  Variable selection methods in QSAR: an overview. , 2008, Current topics in medicinal chemistry.

[23]  K. Baumann,et al.  A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part I. Search algorithm, theory and simulations , 2002 .

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  Stephen R. Johnson,et al.  The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy) , 2008, J. Chem. Inf. Model..

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[28]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[29]  Alexander Tropsha Potential of short-term biological assays to quantitatively predict chronic toxicity , 2013 .

[30]  Yuping Wang,et al.  A systems approach for analysis of high content screening assay data with topic modeling , 2013, BMC Bioinformatics.

[31]  Zheng Yuan,et al.  SVMtm: Support vector machines to predict transmembrane segments , 2004, J. Comput. Chem..

[32]  David Hartsough,et al.  Toward an Optimal Procedure for Variable Selection and QSAR Model Building , 2001, J. Chem. Inf. Comput. Sci..

[33]  C. Hansch,et al.  THE USE OF SUBSTITUENT CONSTANTS IN THE ANALYSIS OF THE STRUCTURE--ACTIVITY RELATIONSHIP IN PENICILLIN DERIVATIVES. , 1964, Journal of medicinal chemistry.

[34]  Fei Yuan,et al.  Chemical Descriptors Are More Important Than Learning Algorithms for Modelling , 2012, Molecular informatics.

[35]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[36]  Márcia M. C. Ferreira,et al.  Is your QSAR/QSPR descriptor real or trash? , 2010 .

[37]  Heikki Huttunen,et al.  Bioprocess data mining using regularized regression and random forests , 2013, BMC Systems Biology.

[38]  Alexander Golbraikh,et al.  QSAR Modeling of the Blood–Brain Barrier Permeability for Diverse Organic Compounds , 2008, Pharmaceutical Research.

[39]  Darrell R Boverhof,et al.  Toxicogenomics in risk assessment: applications and needs. , 2006, Toxicological sciences : an official journal of the Society of Toxicology.

[40]  Xiangwei Zhu,et al.  Hybrid in silico models for drug‐induced liver injury using chemical descriptors and in vitro cell‐imaging information , 2014, Journal of applied toxicology : JAT.

[41]  Kang Li,et al.  Identification of differential gene expression for microarray data using recursive random forest. , 2008, Chinese medical journal.

[42]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[43]  H. Kubinyi Variable Selection in QSAR Studies. II. A Highly Efficient Combination of Systematic Search and Evolution , 1994 .