Spatial Statistics Methods for the Analysis of Chemical Datasets in Virtual Screening Validation Experiments

A common finding of many reports evaluating virtual screening methods is that validation results vary considerably with changing benchmark datasets. It is widely assumed that these effects are caused by the redundancy and cluster structure inherent to those datasets. These phenomena manifest themselves in descriptor space, which is termed the dataset topology. A methodology for the characterization of dataset topology based on spatial statistics is introduced. With this methodology it is possible to associate differences in virtual screening performance on different datasets with differences in dataset topology. Moreover, the better virtual screening performance of certain descriptors can be explained by their ability of representing the benchmark datasets by a more favorable topology. It is shown, that the composition of some benchmark datasets causes topologies that lead to over-optimistic validation results even in very "simple" descriptor spaces. Spatial statistics analysis as proposed here facilitates the detection of such biased datasets and provides a tool for the design of unbiased benchmark datasets. General principles for the design of benchmark datasets, which are not affected by topological bias, were developed. Refined Nearest Neighbor Analysis was used to design benchmark datasets based on PubChem bioactivity data. A workflow is devised that purges datasets of compounds active against pharmaceutically relevant targets from unselective hits. Topological optimization using experimental design strategies was applied to generate corresponding datasets of actives and decoys that are unbiased with regard to analogue bias and artificial enrichment. These datasets provide a tool for an Maximum Unbiased Validation (MUV) of virtual screening methods. The datasets and a MATLAB toolbox for spatial statistics are freely available on the enclosed CD-ROM or via the internet at http://www.pharmchem.tu-bs.de/lehre/baumann/MUV.html.

[1]  C. Matayatsuk,et al.  3D-QSAR studies on phthalimide derivatives as HIV-1 reverse transcriptase inhibitors , 2004 .

[2]  J. Irwin,et al.  Benchmarking sets for molecular docking. , 2006, Journal of medicinal chemistry.

[3]  G. Bemis,et al.  The properties of known drugs. 1. Molecular frameworks. , 1996, Journal of medicinal chemistry.

[4]  Esa Alhoniemi,et al.  Self-organizing map in Matlab: the SOM Toolbox , 1999 .

[5]  Peter Meier,et al.  Key aspects of the Novartis compound collection enhancement project for the compilation of a comprehensive chemogenomics drug discovery screening collection. , 2005, Current topics in medicinal chemistry.

[6]  Yong-Jin Xu,et al.  Using Molecular Equivalence Numbers to Visually Explore Structural Features that Distinguish Chemical Libraries. , 2002 .

[7]  Robert D. Clark,et al.  OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets , 1997, J. Chem. Inf. Comput. Sci..

[8]  Anthony E. Klon,et al.  Application of Machine Learning To Improve the Results of High-Throughput Docking Against the HIV-1 Protease , 2004, Journal of Chemical Information and Modeling.

[9]  Christopher P Austin,et al.  Characterization of chemical libraries for luciferase inhibitory activity. , 2008, Journal of medicinal chemistry.

[10]  P. Willett,et al.  Enhancing the effectiveness of similarity-based virtual screening using nearest-neighbor information. , 2005, Journal of medicinal chemistry.

[11]  G. Schneider,et al.  Scaffold‐Hopping Potential of Ligand‐Based Similarity Concepts , 2006, ChemMedChem.

[12]  Jason A. Wiles,et al.  Small molecule inhibitors of E. coli primase, a novel bacterial target. , 2007, Bioorganic & medicinal chemistry letters.

[13]  R. Glen,et al.  Molecular similarity: a key technique in molecular informatics. , 2004, Organic & biomolecular chemistry.

[14]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[15]  D. Rognan,et al.  Protein-based virtual screening of chemical databases. 1. Evaluation of different docking/scoring combinations. , 2000, Journal of medicinal chemistry.

[16]  Alexander Golbraikh,et al.  Differentiation of AmpC beta-lactamase binders vs. decoys using classification kNN QSAR modeling and application of the QSAR classifier to virtual screening , 2008, J. Comput. Aided Mol. Des..

[17]  Malcolm J. McGregor,et al.  Clustering of Large Databases of Compounds: Using the MDL "Keys" as Structural Descriptors , 1997, J. Chem. Inf. Comput. Sci..

[18]  A. Mironov,et al.  Infrared spectra of pyrroles and dipyrrylmethanes , 1965 .

[19]  A. Hopkins,et al.  Navigating chemical space for biology and medicine , 2004, Nature.

[20]  Jürgen Bajorath,et al.  Bayesian Interpretation of a Distance Function for Navigating High-Dimensional Descriptor Spaces. , 2007 .

[21]  Gisbert Schneider,et al.  Processing and classification of chemical data inspired by insect olfaction , 2007, Proceedings of the National Academy of Sciences.

[22]  E. Jaeger,et al.  Comparison of automated docking programs as virtual screening tools. , 2005, Journal of Medicinal Chemistry.

[23]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[24]  J. Pin,et al.  Virtual screening workflow development guided by the "receiver operating characteristic" curve approach. Application to high-throughput docking on metabotropic glutamate receptor subtype 4. , 2005, Journal of medicinal chemistry.

[25]  Andrew C. Good,et al.  An Empirical Process for the Design of High-Throughput Screening Deck Filters. , 2006 .

[26]  Knut Baumann,et al.  Impact of Benchmark Data Set Topology on the Validation of Virtual Screening Methods: Exploration and Quantification by Spatial Statistics , 2008, J. Chem. Inf. Model..

[27]  Wolf-Dietrich Ihlenfeldt,et al.  Computation and management of chemical properties in CACTVS: An extensible networked approach toward modularity and compatibility , 1994, J. Chem. Inf. Comput. Sci..

[28]  R. Flower The development of COX2 inhibitors , 2003, Nature Reviews Drug Discovery.

[29]  Jürgen Bajorath,et al.  Molecular Similarity Analysis and Virtual Screening by Mapping of Consensus Positions in Binary-Transformed Chemical Descriptor Spaces with Variable Dimensionality , 2004, J. Chem. Inf. Model..

[30]  Jürgen Bajorath,et al.  New methodologies for ligand-based virtual screening. , 2005, Current pharmaceutical design.

[31]  Thomas Lengauer,et al.  Ensemble Methods for Classification in Cheminformatics , 2004, J. Chem. Inf. Model..

[32]  V. A. Zagorevskii,et al.  Synthesis and pharmacological activity of 4H-[1]-benzopyrano[3,4-d]imidazol-4-ones , 1983, Pharmaceutical Chemistry Journal.

[33]  M. Kukhanova,et al.  New triphosphate conjugates bearing reporter groups: labeling of DNA fragments for microarray analysis. , 2007, Bioconjugate chemistry.

[34]  Robert Nadon,et al.  Statistical practice in high-throughput screening data analysis , 2006, Nature Biotechnology.

[35]  W. Patrick Walters,et al.  A guide to drug discovery: Designing screens: how to make your hits a hit , 2003, Nature Reviews Drug Discovery.

[36]  Ian A. Watson,et al.  ErG: 2D Pharmacophore Descriptions for Scaffold Hopping. , 2006 .

[37]  L. Dekker,et al.  Strategies to identify ion channel modulators: current and novel approaches to target neuropathic pain. , 2004, Drug discovery today.

[38]  Tudor I. Oprea,et al.  Optimization of CAMD techniques 3. Virtual screening enrichment studies: a help or hindrance in tool selection? , 2008, J. Comput. Aided Mol. Des..

[39]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[40]  Jürgen Bajorath,et al.  Introduction of a Generally Applicable Method to Estimate Retrieval of Active Molecules for Similarity Searching using Fingerprints , 2007, ChemMedChem.

[41]  Richard D. Taylor,et al.  Virtual Screening Using Protein—Ligand Docking: Avoiding Artificial Enrichment. , 2004 .

[42]  M. E. Johnson,et al.  Some Guidelines for Constructing Exact D-Optimal Designs on Convex Design Spaces , 1983 .

[43]  Ing-Marie Olsson,et al.  D-optimal onion designs in statistical molecular design , 2004 .

[44]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology , 2003, Nucleic Acids Res..

[45]  Jürgen Bajorath,et al.  Integration of virtual and high-throughput screening , 2002, Nature Reviews Drug Discovery.

[46]  N. Stiefl,et al.  Mapping property distributions of molecular surfaces: algorithm and evaluation of a novel 3D quantitative structure-activity relationship technique. , 2003, Journal of medicinal chemistry.

[47]  Anthony C. Atkinson,et al.  Optimum Experimental Designs , 1992 .

[48]  P. Gund Three-Dimensional Pharmacophoric Pattern Searching , 1977 .

[49]  G. Schneider,et al.  Virtual Screening for Bioactive Molecules , 2000 .

[50]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[51]  B. Shoichet,et al.  High-throughput assays for promiscuous inhibitors , 2005, Nature chemical biology.

[52]  P. Hawkins,et al.  Comparison of shape-matching and docking as virtual screening tools. , 2007, Journal of medicinal chemistry.

[53]  Bin Zhou,et al.  Large-Scale Annotation of Small-Molecule Libraries Using Public Databases , 2007, J. Chem. Inf. Model..

[54]  Ajay N. Jain,et al.  Robust ligand-based modeling of the biological targets of known drugs. , 2006, Journal of medicinal chemistry.

[55]  Andreas Bender,et al.  A Discussion of Measures of Enrichment in Virtual Screening: Comparing the Information Content of Descriptors with Increasing Levels of Sophistication , 2005, J. Chem. Inf. Model..

[56]  Walter Sneader,et al.  Drug Discovery (The History) , 2005 .

[57]  P. Jaccard Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines , 1901 .

[58]  Paul Ehrlich Über den jetzigen Stand der Chemotherapie , 1960 .

[59]  J. Mandel Use of the Singular Value Decomposition in Regression Analysis , 1982 .

[60]  Ajay N. Jain,et al.  Parameter estimation for scoring protein-ligand interactions using negative training data. , 2006, Journal of medicinal chemistry.

[61]  Andreas Bender,et al.  Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT 2D): Evaluation of Performance , 2004, J. Chem. Inf. Model..

[62]  Zsolt Zsoldos,et al.  LASSO—ligand activity by surface similarity order: a new tool for ligand based virtual screening , 2008, J. Comput. Aided Mol. Des..

[63]  Andrew C. Good,et al.  Measuring CAMD Technique Performance, 2. How "Druglike" Are Drugs? Implications of Random Test Set Selection Exemplified Using Druglikeness Classification Models , 2007, J. Chem. Inf. Model..

[64]  J. Alvarez High-throughput docking as a source of novel drug leads. , 2004, Current opinion in chemical biology.

[65]  Andreas Bender,et al.  Understanding False Positives in Reporter Gene Assays: in Silico Chemogenomics Approaches To Prioritize Cell-Based HTS Data , 2007, J. Chem. Inf. Model..

[66]  Robert D Clark,et al.  Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. , 1996, Journal of medicinal chemistry.

[67]  Jürgen Bajorath,et al.  Introduction of an Information-Theoretic Method to Predict Recovery Rates of Active Compounds for Bayesian in Silico Screening: Theory and Screening Trials , 2007, J. Chem. Inf. Model..

[68]  Thomas Gärtner,et al.  Support-Vector-Machine-Based Ranking Significantly Improves the Effectiveness of Similarity Searching Using 2D Fingerprints and Multiple Reference Compounds , 2008, J. Chem. Inf. Model..

[69]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[70]  J. Bajorath Selected Concepts and Investigations in Compound Classification, Molecular Descriptor Analysis, and Virtual Screening , 2001 .

[71]  Ajay N. Jain,et al.  Recommendations for evaluation of computational methods , 2008, J. Comput. Aided Mol. Des..

[72]  Samuel Kaski,et al.  Comparing Self-Organizing Maps , 1996, ICANN.

[73]  J. Mason,et al.  New 4-point pharmacophore method for molecular similarity and diversity applications: overview of the method and applications, including a novel approach to the design of combinatorial libraries containing privileged substructures. , 1999, Journal of medicinal chemistry.

[74]  J. Irwin,et al.  ZINC ? A Free Database of Commercially Available Compounds for Virtual Screening. , 2005 .

[75]  Ing-Marie Olsson,et al.  Controlling coverage of D‐optimal onion designs and selections , 2004 .

[76]  Y. Martin,et al.  Do structurally similar molecules have similar biological activity? , 2002, Journal of medicinal chemistry.

[77]  E. Fluder,et al.  Protocols for Bridging the Peptide to Nonpeptide Gap in Topological Similarity Searches. , 2001 .

[78]  Christopher P Austin,et al.  A high-throughput screen for aggregation-based inhibition in a large compound library. , 2007, Journal of medicinal chemistry.

[79]  Shigeo Yamamoto,et al.  Structure-activity Relationships of Fungicidal N-Benzoylanthranilic Esters , 1980 .

[80]  P. Willett,et al.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. , 2004, Organic & biomolecular chemistry.

[81]  J. B. Paine,et al.  Pyrrole Chemistry. An Improved Synthesis of Ethyl Pyrrole-2-carboxylate Esters from Diethyl Aminomalonate. , 1986 .

[82]  T. Bailey Spatial Analysis: A Guide for Ecologists , 2006 .

[83]  John J. Irwin,et al.  Community benchmarks for virtual screening , 2008, J. Comput. Aided Mol. Des..

[84]  P. Hänninen,et al.  Syntheses of Novel Dipyrrylmethene-BF2 Dyes and Their Performance as Labels in Two-Photon Excited Fluoroimmunoassay , 2004, Journal of Fluorescence.

[85]  Wolfgang Guba,et al.  Development of a virtual screening method for identification of "frequent hitters" in compound libraries. , 2002, Journal of medicinal chemistry.

[86]  Z. Trávníček,et al.  Metal complexes as anticancer agents 2. Iron(III) and copper(II) bio-active complexes with N6-benzylaminopurine derivatives , 2001 .

[87]  M Rarey,et al.  Detailed analysis of scoring functions for virtual screening. , 2001, Journal of medicinal chemistry.

[88]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[89]  E. Zerhouni The NIH Roadmap , 2003, Science.

[90]  Schmid,et al.  "Scaffold-Hopping" by Topological Pharmacophore Search: A Contribution to Virtual Screening. , 1999, Angewandte Chemie.

[91]  N. R. Williams,et al.  772. Pyrroles and related compounds. Part I. Syntheses of some unsymmetrical pyrrolylmethylpyrroles (pyrromethanes) , 1958 .

[92]  Adam Yasgar,et al.  Quantitative high-throughput screening: a titration-based approach that efficiently identifies biological activities in large chemical libraries. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[93]  Jun Xu A new approach to finding natural chemical structure classes. , 2002, Journal of medicinal chemistry.

[94]  Jérôme Hert,et al.  Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures , 2004, J. Chem. Inf. Model..

[95]  Martin Stahl,et al.  Scoring functions for protein-ligand interactions: a critical perspective. , 2004, Drug discovery today. Technologies.

[96]  Christopher I. Bayly,et al.  Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem , 2007, J. Chem. Inf. Model..

[97]  L. A. Stone,et al.  Computer Aided Design of Experiments , 1969 .

[98]  M. Petrova,et al.  The Reaction of 2-Aminoethyl- and 3-Aminopropyl-substituted Heterocycles with 2-Formyl-1,3-cyclanediones and 4-Oxo-3,1-benzoxazines , 2002 .

[99]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[100]  P. Charifson,et al.  Improved scoring of ligand-protein interactions using OWFEG free energy grids. , 2001, Journal of medicinal chemistry.

[101]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[102]  Ruili Huang,et al.  Fluorescence spectroscopic profiling of compound libraries. , 2008, Journal of medicinal chemistry.

[103]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[104]  Knut Baumann,et al.  An Alignment-Independent Versatile Structure Descriptor for QSAR and QSPR Based on the Distribution of Molecular Features , 2002, J. Chem. Inf. Comput. Sci..

[105]  B. Shoichet Screening in a spirit haunted world. , 2006, Drug discovery today.

[106]  Andrew C. Good,et al.  Measuring CAMD technique performance: A virtual screening case study in the design of validation experiments , 2004, J. Comput. Aided Mol. Des..

[107]  Yvonne C. Martin,et al.  The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding , 1997, J. Chem. Inf. Comput. Sci..

[108]  Gerhard Hessler,et al.  Drug Design Strategies for Targeting G‐Protein‐Coupled Receptors , 2002, Chembiochem : a European journal of chemical biology.