Variable Selection in Random Forest with Application to Quantitative Structure-Activity Relationship

A wrapper variable selection procedure is proposed for use with learning machines that generate a measure of variable importance, such as Random Forest. The procedure is based on iteratively removing low-ranking variables and assessing the learning machine performance by cross-validation. The procedure is implemented for Random Forest on some QSAR modeling examples from drug discovery and development. It is shown that the non-recursive version of the procedure outperforms the recursive version, and that the default Random Forest mtry function is usually adequate. The paper concludes with some comments about performance assessment and the dangers of using Random Forest’s outof-bag error estimate in a variable selection wrapper.

[1]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[2]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[3]  Bruce L. Bush,et al.  Extending the trend vector: The trend matrix and sample-based partial least squares , 1994, J. Comput. Aided Mol. Des..

[4]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[5]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..

[6]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[7]  Erik Evensen,et al.  A computational ensemble pharmacophore model for identifying substrates of P-glycoprotein. , 2002, Journal of medicinal chemistry.

[8]  Philip Jonathan,et al.  Statistical thinking and technique for QSAR and related studies. Part I: General theory , 1993 .

[9]  David J. Livingstone,et al.  The Characterization of Chemical Structures Using Molecular Properties. A Survey , 2000, J. Chem. Inf. Comput. Sci..

[10]  S. Ekins,et al.  Progress in predicting human ADME parameters in silico. , 2000, Journal of pharmacological and toxicological methods.

[11]  Douglas M. Hawkins,et al.  QSAR with Few Compounds and Many Features , 2001, J. Chem. Inf. Comput. Sci..

[12]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.