A Model-Based Ensembling Approach for Developing QSARs

Ensemble methods have become popular for QSAR modeling, but most studies have assumed balanced data, consisting of approximately equal numbers of active and inactive compounds. Cheminformatics data are often far from being balanced. We extend the application of ensemble methods to include cases of imbalance of class membership and to more adequately assess model output. Based on the extension, we propose an ensemble method called MBEnsemble that automatically determines the appropriate tuning parameters to provide reliable predictions and maximize the F-measure. Results from multiple data sets demonstrate that the proposed ensemble technique works well on imbalanced data.

[1]  Jun Feng,et al.  PowerMV: A Software Environment for Molecular Viewing, Descriptor Generation, Data Analysis and Hit Evaluation , 2005, J. Chem. Inf. Model..

[2]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[3]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Christophe G. Lambert,et al.  Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning , 1999, J. Chem. Inf. Comput. Sci..

[5]  J. Sutherland,et al.  A comparison of methods for modeling quantitative structure-activity relationships. , 2004, Journal of medicinal chemistry.

[6]  D J Gans,et al.  On the significance of clusters in the graphical display of structure-activity data. , 1986, Journal of medicinal chemistry.

[7]  Ke Zhang Statistical Analysis of Compounds Using OBSTree and Compound Mixtures Using Nonlinear Models , 2007 .

[8]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[9]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[10]  Thomas Hofmann,et al.  Predicting CNS Permeability of Drug Molecules: Comparison of Neural Network and Support Vector Machine Algorithms , 2002, J. Comput. Biol..

[11]  Gregory W. Kauffman,et al.  QSAR and k-Nearest Neighbor Classification Analysis of Selective Cyclooxygenase-2 Inhibitors Using Topologically-Based Numerical Descriptors , 2001, J. Chem. Inf. Comput. Sci..

[12]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[13]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.