Benchmarking Variable Selection in QSAR

Variable selection is important in QSAR modeling since it can improve model performance and transparency, as well as reduce the computational cost of model fitting and predictions. Which variable selection methods that perform well in QSAR settings is largely unknown. To address this question we, in a total of 1728 benchmarking experiments, rigorously investigated how eight variable selection methods affect the predictive performance and transparency of random forest models fitted to seven QSAR datasets covering different endpoints, descriptors sets, types of response variables, and number of chemical compounds. The results show that univariate variable selection methods are suboptimal and that the number of variables in the benchmarked datasets can be reduced with about 60 % without significant loss in model performance when using multivariate adaptive regression splines MARS and forward selection.

[1]  Robert D. Carr,et al.  The Signature Molecular Descriptor. 4. Canonizing Molecules Using Extended Valence Sequences , 2004, J. Chem. Inf. Model..

[2]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[3]  Peteris Prusis,et al.  Unbiased descriptor and parameter selection confirms the potential of proteochemometric modelling , 2005, BMC Bioinformatics.

[4]  Kimito Funatsu,et al.  The Recent Trend in QSAR Modeling - Variable Selection and 3D-QSAR Methods , 2007 .

[5]  Angelo Carotti,et al.  QSAR and QSPR Studies of a Highly Structured Physicochemical Domain , 2006, J. Chem. Inf. Model..

[6]  Shu-Shen Liu,et al.  VSMP: A Novel Variable Selection and Modeling Method Based on the Prediction , 2003, J. Chem. Inf. Comput. Sci..

[7]  Jean-Loup Faulon,et al.  The Signature Molecular Descriptor. 1. Using Extended Valence Sequences in QSAR and QSPR Studies , 2003, J. Chem. Inf. Comput. Sci..

[8]  Simon Günter,et al.  Stratification bias in low signal microarray studies , 2007, BMC Bioinformatics.

[9]  Wei Kong,et al.  QSAR analysis of tyrosine kinase inhibitor using modified ant colony optimization and multiple linear regression. , 2007, European journal of medicinal chemistry.

[10]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  J. Friedman Multivariate adaptive regression splines , 1990 .

[13]  Ola Spjuth,et al.  The C1C2: A framework for simultaneous model selection and assessment , 2008, BMC Bioinformatics.

[14]  Scott Boyer,et al.  Interpretation of Nonlinear QSAR Models Applied to Ames Mutagenicity Data , 2009, J. Chem. Inf. Model..

[15]  G. Theraulaz,et al.  Inspiration for optimization from social insect behaviour , 2000, Nature.

[16]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[17]  Klaus-Robert Müller,et al.  Benchmark Data Set for in Silico Prediction of Ames Mutagenicity , 2009, J. Chem. Inf. Model..

[18]  Jean-Loup Faulon,et al.  The Signature Molecular Descriptor. 2. Enumerating Molecules from Their Extended Valence Sequences , 2003, J. Chem. Inf. Comput. Sci..

[19]  G. Cruciani,et al.  Generating Optimal Linear PLS Estimations (GOLPE): An Advanced Chemometric Tool for Handling 3D‐QSAR Problems , 1993 .

[20]  Harry Wechsler,et al.  From Statistics to Neural Networks , 1994, NATO ASI Series.

[21]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[22]  P. Labute A widely applicable set of descriptors. , 2000, Journal of molecular graphics & modelling.

[23]  Ola Spjuth,et al.  Integrated Decision Support for Assessing Chemical Liabilities , 2011, J. Chem. Inf. Model..

[24]  Maykel Pérez González,et al.  Variable selection methods in QSAR: an overview. , 2008, Current topics in medicinal chemistry.

[25]  Anton J. Hopfinger,et al.  Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships , 1994, J. Chem. Inf. Comput. Sci..

[26]  James Kennedy,et al.  Particle swarm optimization , 2002, Proceedings of ICNN'95 - International Conference on Neural Networks.

[27]  Sung Jin Cho,et al.  Genetic Algorithm Guided Selection: Variable Selection and Subset Selection , 2002, J. Chem. Inf. Comput. Sci..