Assessing Different Classification Methods for Virtual Screening

How well do different classification methods perform in selecting the ligands of a protein target out of large compound collections not used to train the model? Support vector machines, random forest, artificial neural networks, k-nearest-neighbor classification with genetic-algorithm-optimized feature selection, trend vectors, naïve Bayesian classification, and decision tree were used to divide databases into molecules predicted to be active and those predicted to be inactive. Training and predicted activities were treated as binary. The database was generated for the ligands of five different biological targets which have been the object of intense drug discovery efforts: HIV-reverse transcriptase, COX2, dihydrofolate reductase, estrogen receptor, and thrombin. We report significant differences in the performance of the methods independent of the biological target and compound class. Different methods can have different applications; some provide particularly high enrichment, others are strong in retrieving the maximum number of actives. We also show that these methods do surprisingly well in predicting recently published ligands of a target on the basis of initial leads and that a combination of the results of different methods in certain cases can improve results compared to the most consistent method.

[1]  Paul Labute,et al.  Binary QSAR: A New Method for the Determination of Quantitative Structure Activity Relationships , 1998, Pacific Symposium on Biocomputing.

[2]  Andreas Evers,et al.  Virtual screening of biogenic amine-binding G-protein coupled receptors: comparative evaluation of protein- and ligand-based virtual screening protocols. , 2005, Journal of medicinal chemistry.

[3]  Luc Morin-Allory,et al.  2D QSAR Consensus Prediction for High-Throughput Virtual Screening. An Application to COX-2 Inhibition Modeling and Screening of the NCI Database , 2004, J. Chem. Inf. Model..

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  M. Murcko,et al.  Chemogenomic approaches to drug discovery. , 2001, Current opinion in chemical biology.

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[8]  Jürgen Bajorath,et al.  New methodologies for ligand-based virtual screening. , 2005, Current pharmaceutical design.

[9]  Christophe G. Lambert,et al.  Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning , 1999, J. Chem. Inf. Comput. Sci..

[10]  Chih-Jen Lin,et al.  Training nu-Support Vector Classifiers: Theory and Algorithms , 2001, Neural Comput..

[11]  Robert P. Sheridan,et al.  Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR , 2004, J. Chem. Inf. Model..

[12]  Rajarshi Guha,et al.  Interpreting Computational Neural Network QSAR Models: A Measure of Descriptor Importance , 2005, J. Chem. Inf. Model..

[13]  Hans Briem,et al.  Classifying “Kinase Inhibitor‐Likeness” by Using Machine‐Learning Methods , 2005, Chembiochem : a European journal of chemical biology.

[14]  Peter Willett,et al.  Comparison of Ranking Methods for Virtual Screening in Lead-Discovery Programs , 2003, J. Chem. Inf. Comput. Sci..

[15]  Gregory W. Kauffman,et al.  QSAR and k-Nearest Neighbor Classification Analysis of Selective Cyclooxygenase-2 Inhibitors Using Topologically-Based Numerical Descriptors , 2001, J. Chem. Inf. Comput. Sci..

[16]  Dariusz M Plewczynski,et al.  A support vector machine approach to the identification of phosphorylation sites. , 2005, Cellular & molecular biology letters.

[17]  G Schneider,et al.  Artificial neural networks for computer-based molecular design. , 1998, Progress in biophysics and molecular biology.

[18]  Robert P. Sheridan,et al.  The Centroid Approximation for Mixtures: Calculating Similarity and Deriving Structure-Activity Relationships , 2000, J. Chem. Inf. Comput. Sci..

[19]  Michael K. Gilson,et al.  Virtual Screening of Molecular Databases Using a Support Vector Machine , 2005, J. Chem. Inf. Model..

[20]  W. Punch,et al.  Predicting conserved water-mediated and polar ligand interactions in proteins using a K-nearest-neighbors genetic algorithm. , 1997, Journal of molecular biology.

[21]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..

[22]  D. J. Price,et al.  Assessing scoring functions for protein-ligand interactions. , 2004, Journal of medicinal chemistry.

[23]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[24]  P Willett,et al.  Similarity-based approaches to virtual screening. , 2003, Biochemical Society transactions.

[25]  Tudor I. Oprea,et al.  Integrating virtual screening in lead discovery. , 2004, Current opinion in chemical biology.

[26]  Jürgen Bajorath,et al.  Virtual screening methods that complement HTS. , 2004, Combinatorial chemistry & high throughput screening.

[27]  Hualiang Jiang,et al.  Discovering novel chemical inhibitors of human cyclophilin A: Virtual screening, synthesis, and bioassay , 2005, Bioorganic & Medicinal Chemistry.

[28]  Tudor I. Oprea,et al.  Post-High-Throughput Screening Analysis: An Empirical Compound Prioritization Scheme , 2005, Journal of biomolecular screening.

[29]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[30]  Ruth Nussinov,et al.  Principles of docking: An overview of search algorithms and a guide to scoring functions , 2002, Proteins.

[31]  Ling Yang,et al.  Classification of Substrates and Inhibitors of P-Glycoprotein Using Unsupervised Machine Learning Approach , 2005, J. Chem. Inf. Model..

[32]  Bruce L. Bush,et al.  Extending the trend vector: The trend matrix and sample-based partial least squares , 1994, J. Comput. Aided Mol. Des..

[33]  Walter Cedeño,et al.  On the Use of Neural Network Ensembles in QSAR and QSPR , 2002, J. Chem. Inf. Comput. Sci..

[34]  Osman F Guner The impact of pharmacophore modeling in drug design. , 2005, IDrugs : the investigational drugs journal.

[35]  Robert Schweitzer,et al.  Comparison of Assay Technologies for a Tyrosine Kinase Assay Generates Different Results in High Throughput Screening , 2002, Journal of biomolecular screening.

[36]  Chih-Jen Lin,et al.  Training v-Support Vector Classifiers: Theory and Algorithms , 2001, Neural Computation.