The effect of structural redundancy in validation sets on virtual screening performance

The performance of a classification model is often assessed in terms of how well it separates a set of known observations into appropriate classes. If the validation sets used for such analyses are redundant due to bias in sampling, the relevance of the conclusions drawn to prospective work in which new kinds of positives are sought may be compromised. In the case of the various virtual screening techniques used in modern drug discovery, such bias generally appears as over‐representation of particular structural subclasses in the test set. We show how clustering by substructural similarity, followed by applying arithmetic and harmonic weighting schemes to receiver operating characteristic (ROC) curves, can be used to identify validation sets that are biased due to such redundancies. This can be accomplished qualitatively by direct examination or quantitatively by comparing the areas under the respective linear or semilog curves (AUCs or pAUCs). Copyright © 2009 John Wiley & Sons, Ltd.

[1]  Anthony Nicholls,et al.  What do we know and when do we know it? , 2008, J. Comput. Aided Mol. Des..

[2]  F. Yates Contingency Tables Involving Small Numbers and the χ2 Test , 1934 .

[3]  Tudor I. Oprea,et al.  Optimization of CAMD techniques 3. Virtual screening enrichment studies: a help or hindrance in tool selection? , 2008, J. Comput. Aided Mol. Des..

[4]  J. Irwin,et al.  Benchmarking sets for molecular docking. , 2006, Journal of medicinal chemistry.

[5]  Didier Rognan,et al.  Comparative evaluation of eight docking tools for docking and virtual screening accuracy , 2004, Proteins.

[6]  Christopher I. Bayly,et al.  Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem , 2007, J. Chem. Inf. Model..

[7]  Robert P. Sheridan,et al.  Protocols for Bridging the Peptide to Nonpeptide Gap in Topological Similarity Searches , 2001, J. Chem. Inf. Comput. Sci..

[8]  Ajay N. Jain Morphological similarity: A 3D molecular similarity method correlated with protein-ligand recognition , 2000, J. Comput. Aided Mol. Des..

[9]  G. Bemis,et al.  The properties of known drugs. 1. Molecular frameworks. , 1996, Journal of medicinal chemistry.

[10]  Andreas Bender,et al.  A Discussion of Measures of Enrichment in Virtual Screening: Comparing the Information Content of Descriptors with Increasing Levels of Sophistication , 2005, J. Chem. Inf. Model..

[11]  Andrew C. Good,et al.  Measuring CAMD technique performance: A virtual screening case study in the design of validation experiments , 2004, J. Comput. Aided Mol. Des..

[12]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[13]  R. Webster Homer,et al.  SYBYL Line Notation (SLN): A Versatile Language for Chemical Structure Representation , 1997, J. Chem. Inf. Comput. Sci..

[14]  J. Pin,et al.  Virtual screening workflow development guided by the "receiver operating characteristic" curve approach. Application to high-throughput docking on metabotropic glutamate receptor subtype 4. , 2005, Journal of medicinal chemistry.

[15]  Knut Baumann,et al.  Impact of Benchmark Data Set Topology on the Validation of Virtual Screening Methods: Exploration and Quantification by Spatial Statistics , 2008, J. Chem. Inf. Model..

[16]  P. Willett,et al.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. , 2004, Organic & biomolecular chemistry.

[17]  Peter C. Fox,et al.  Statistical variation in progressive scrambling , 2004, J. Comput. Aided Mol. Des..

[18]  Ajay N. Jain,et al.  Ligand-based structural hypotheses for virtual screening. , 2004, Journal of medicinal chemistry.

[19]  Ajay N. Jain Surflex-Dock 2.1: Robust performance from ligand energetic modeling, ring flexibility, and knowledge-based search , 2007, J. Comput. Aided Mol. Des..

[20]  Ronan Bureau,et al.  The Maximum Common Substructure as a Molecular Depiction in a Supervised Classification Context: Experiments in Quantitative Structure/Biodegradability Relationships , 2002, J. Chem. Inf. Comput. Sci..

[21]  Maciej Haranczyk,et al.  Comparison of Similarity Coefficients for Clustering and Compound Selection , 2008, J. Chem. Inf. Model..

[22]  Robert P. Sheridan,et al.  Chemical Similarity Using Geometric Atom Pair Descriptors , 1996, J. Chem. Inf. Comput. Sci..

[23]  John B. O. Mitchell,et al.  Classifying the World Anti-Doping Agency's 2005 Prohibited List Using the Chemistry Development Kit Fingerprint , 2006, CompLife.

[24]  C. E. Peishoff,et al.  A critical assessment of docking programs and scoring functions. , 2006, Journal of medicinal chemistry.

[25]  John W. Liebeschuetz,et al.  Evaluating docking programs: keeping the playing field level , 2008, J. Comput. Aided Mol. Des..

[26]  Robert D. Clark Getting past diversity in assessing virtual library designs , 2002 .

[27]  D. J. Price,et al.  Assessing scoring functions for protein-ligand interactions. , 2004, Journal of medicinal chemistry.

[28]  Jérôme Hert,et al.  New Methods for Ligand-Based Virtual Screening: Use of Data Fusion and Machine Learning to Enhance the Effectiveness of Similarity Searching , 2006, J. Chem. Inf. Model..

[29]  Ruedi Stoop,et al.  An Ontology for Pharmaceutical Ligands and Its Application for in Silico Screening and Library Design , 2002, J. Chem. Inf. Comput. Sci..

[30]  Robert D. Clark,et al.  Managing bias in ROC curves , 2008, J. Comput. Aided Mol. Des..

[31]  Paul Watson,et al.  Virtual Screening Using Protein-Ligand Docking: Avoiding Artificial Enrichment , 2004, J. Chem. Inf. Model..

[32]  Arup K. Ghose,et al.  Knowledge based prediction of ligand binding modes and rational inhibitor design for kinase drug discovery. , 2008, Journal of medicinal chemistry.