Application of Breiman's Random Forest to Modeling Structure-Activity Relationships of Pharmaceutical Molecules

Leo Breiman’s Random Forest ensemble learning procedure is applied to the problem of Quantitative Structure-Activity Relationship (QSAR) modeling for pharmaceutical molecules. This entails using a quantitative description of a compound’s molecular structure to predict that compound’s biological activity as measured in an in vitro assay. Without any parameter tuning, the performance of Random Forest with default settings on six publicly available data sets is already as good or better than that of three other prominent QSAR methods: Decision Tree, Partial Least Squares, and Support Vector Machine. In addition to reliable prediction accuracy, Random Forest provides variable importance measures which can be used in a variable reduction wrapper algorithm. Comparisons of various such wrappers and between Random Forest and Bagging are presented.

[1]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[2]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[3]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[4]  J F McElroy,et al.  Novel piperidine sigma receptor ligands as potential antipsychotic drugs. , 1992, Journal of medicinal chemistry.

[5]  Weida Tong,et al.  Decision Forest: Combining the Predictions of Multiple Independent Decision Tree Models , 2003, J. Chem. Inf. Comput. Sci..

[6]  Thomas Hofmann,et al.  Predicting CNS Permeability of Drug Molecules: Comparison of Neural Network and Support Vector Machine Algorithms , 2002, J. Comput. Biol..

[7]  Douglas M. Hawkins,et al.  QSAR with Few Compounds and Many Features , 2001, J. Chem. Inf. Comput. Sci..

[8]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Erik Evensen,et al.  A computational ensemble pharmacophore model for identifying substrates of P-glycoprotein. , 2002, Journal of medicinal chemistry.

[10]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[11]  Bogdan E. Popescu,et al.  Importance Sampled Learning Ensembles , 2003 .

[12]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[13]  S. Ekins,et al.  Progress in predicting human ADME parameters in silico. , 2000, Journal of pharmacological and toxicological methods.

[14]  P. Jurs,et al.  Classification of multidrug-resistance reversal agents using structure-based descriptors and linear discriminant analysis. , 2000, Journal of medicinal chemistry.

[15]  Gregory W. Kauffman,et al.  QSAR and k-Nearest Neighbor Classification Analysis of Selective Cyclooxygenase-2 Inhibitors Using Topologically-Based Numerical Descriptors , 2001, J. Chem. Inf. Comput. Sci..