Practical Outcomes of Applying Ensemble Machine Learning Classifiers to High-Throughput Screening (HTS) Data Analysis and Screening

Over the years numerous papers have presented the effectiveness of various machine learning methods in analyzing drug discovery biological screening data. The predictive performance of models developed using these methods has traditionally been evaluated by assessing performance of the developed models against a portion of the data randomly selected for holdout. It has been our experience that such assessments, while widely practiced, result in an optimistic assessment. This paper describes the development of a series of ensemble-based decision tree models, shares our experience at various stages in the model development process, and presents the impact of such models when they are applied to vendor offerings and the forecasted compounds are acquired and screened in the relevant assays. We have seen that well developed models can significantly increase the hit-rates observed in HTS campaigns.

[1]  David J. Livingstone,et al.  The Characterization of Chemical Structures Using Molecular Properties. A Survey , 2000, J. Chem. Inf. Comput. Sci..

[2]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..

[3]  Qing-You Zhang,et al.  Random Forest Prediction of Mutagenicity from Empirical Physicochemical Descriptors , 2007, J. Chem. Inf. Model..

[4]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[5]  A. Michiel van Rhee,et al.  Use of Recursion Forests in the Sequential Screening Process: Consensus Selection by Multiple Recursion Trees , 2003, J. Chem. Inf. Comput. Sci..

[6]  Ting Wang,et al.  Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling , 2005, J. Chem. Inf. Model..

[7]  David A. Yuen,et al.  Ensemble of Linear Models for Predicting Drug Properties , 2005, J. Chem. Inf. Model..

[8]  A. Owens,et al.  Efficient training of the backpropagation network by solving a system of stiff ordinary differential equations , 1989, International 1989 Joint Conference on Neural Networks.

[9]  Christian Lemmen,et al.  Using Ensembles to Classify Compounds for Drug Discovery , 2003, J. Chem. Inf. Comput. Sci..

[10]  Brian D. Hudson,et al.  A Consensus Neural Network-Based Technique for Discriminating Soluble and Poorly Soluble Compounds , 2003, J. Chem. Inf. Comput. Sci..

[11]  Jürgen Bajorath,et al.  Selected Concepts and Investigations in Compound Classification, Molecular Descriptor Analysis, and Virtual Screening , 2001, J. Chem. Inf. Comput. Sci..

[12]  Valerie J. Gillet,et al.  Introducing the Consensus Modeling Concept in Genetic Algorithms: Application to Interpretable Discriminant Analysis , 2006, J. Chem. Inf. Model..

[13]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[14]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[15]  Ganesh Vaidyanathan InfoEvolve: moving from data to knowledge using information theory and genetic algorithms. , 2004, Annals of the New York Academy of Sciences.

[16]  Giuseppina C. Gini,et al.  Combining Unsupervised and Supervised Artificial Neural Networks to PredictAquatic Toxicity , 2004, J. Chem. Inf. Model..

[17]  Reiji Teramoto,et al.  Supervised Consensus Scoring for Docking and Virtual Screening , 2007, J. Chem. Inf. Model..

[18]  K. M. Smith,et al.  Novel software tools for chemical diversity , 1998 .

[19]  Robert C. Glen,et al.  Random Forest Models To Predict Aqueous Solubility , 2007, J. Chem. Inf. Model..

[20]  Ting Chen,et al.  Ensemble Feature Selection: Consistent Descriptor Subsets for Multiple QSAR Models , 2007, J. Chem. Inf. Model..

[21]  Johannes Grotendorst,et al.  Classification of Highly Unbalanced CYP450 Data of Drugs Using Cost Sensitive Machine Learning Techniques , 2007, J. Chem. Inf. Model..

[22]  Lori B. Pfahler,et al.  Lead Discovery Using Stochastic Cluster Analysis (SCA): A New Method for Clustering Structurally Similar Compounds , 1998, J. Chem. Inf. Comput. Sci..

[23]  Walter Cedeño,et al.  On the Use of Neural Network Ensembles in QSAR and QSPR , 2002, J. Chem. Inf. Comput. Sci..

[24]  Thomas M. Ehrman,et al.  Virtual Screening of Chinese Herbs with Random Forest , 2007, J. Chem. Inf. Model..

[25]  John Kinney,et al.  Comparative Study of Machine-Learning and Chemometric Tools for Analysis of In-Vivo High-Throughput Screening Data , 2008, J. Chem. Inf. Model..

[26]  Weida Tong,et al.  Decision Forest: Combining the Predictions of Multiple Independent Decision Tree Models , 2003, J. Chem. Inf. Comput. Sci..

[27]  Robert P. Sheridan,et al.  Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR , 2004, J. Chem. Inf. Model..