How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space

Different molecular descriptors capture different aspects of molecular structures, but this effect has not yet been quantified systematically on a large scale. In this work, we calculate the similarity of 37 descriptors by repeatedly selecting query compounds and ranking the rest of the database. Euclidean distances between the rank-ordering of different descriptors are calculated to determine descriptor (as opposed to compound) similarity, followed by PCA for visualization. Four broad descriptor classes are identified, which are circular fingerprints; circular fingerprints considering counts; path-based and keyed fingerprints; and pharmacophoric descriptors. Descriptor behavior is much more defined by those four classes than the particular parametrization. Using counts instead of the presence/absence of fingerprints significantly changes descriptor behavior, which is crucial for performance of topological autocorrelation vectors, but not circular fingerprints. Four-point pharmacophores (piDAPH4) surprisingly lead to much higher retrieval rates than three-point pharmacophores (28.21% vs 19.15%) but still similar rank-ordering of compounds (retrieval of similar actives). Looking into individual rankings, circular fingerprints seem more appropriate than path-based fingerprints if complex ring systems or branching patterns are present; count-based fingerprints could be more suitable in databases with a large number of repeated subunits (amide bonds, sugar rings, terpenes). Information-based selection of diverse fingerprints for consensus scoring (ECFP4/TGD fingerprints) led only to marginal improvement over single fingerprint results. While it seems to be nontrivial to exploit orthogonal descriptor behavior to improve retrieval rates in consensus virtual screening, those descriptors still each retrieve different actives which corroborates the strategy of employing diverse descriptors individually in prospective virtual screening settings.

[1]  Robert D Clark,et al.  Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. , 1996, Journal of medicinal chemistry.

[2]  Cheng Cheng,et al.  Four Association Coefficients for Relating Molecular Similarity Measures , 1996, J. Chem. Inf. Comput. Sci..

[3]  S. Basak,et al.  Characterization of Molecular Structures Using Topological Indices , 1997 .

[4]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[5]  J. Mason,et al.  New 4-point pharmacophore method for molecular similarity and diversity applications: overview of the method and applications, including a novel approach to the design of combinatorial libraries containing privileged substructures. , 1999, Journal of medicinal chemistry.

[6]  M. Murcko,et al.  Consensus scoring: A method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins. , 1999, Journal of medicinal chemistry.

[7]  P. Willett,et al.  Combination of molecular similarity measures using data fusion , 2000 .

[8]  Shaomeng Wang,et al.  How Does Consensus Scoring Work for Virtual Library Screening? An Idealized Computer Experiment , 2001, J. Chem. Inf. Comput. Sci..

[9]  J. Bajorath Selected Concepts and Investigations in Compound Classification, Molecular Descriptor Analysis, and Virtual Screening , 2001 .

[10]  E. Jacoby A Novel Chemogenomics Knowledge-Based Ligand Design Strategy—Application to G Protein-Coupled Receptors , 2001 .

[11]  Y. Martin,et al.  Do structurally similar molecules have similar biological activity? , 2002, Journal of medicinal chemistry.

[12]  James G. Nourse,et al.  Reoptimization of MDL Keys for Use in Drug Discovery , 2002, J. Chem. Inf. Comput. Sci..

[13]  N. Nikolova,et al.  International Union of Pure and Applied Chemistry, LUMO energy ± The Lowest Unoccupied Molecular Orbital (LUMO) , 2022 .

[14]  Petra Schneider,et al.  Comparison of correlation vector methods for ligand-based similarity searching , 2003, J. Comput. Aided Mol. Des..

[15]  Pierre Acklin,et al.  Similarity Metrics for Ligands Reflecting the Similarity of the Target Proteins , 2003, J. Chem. Inf. Comput. Sci..

[16]  J. Jenkins,et al.  A 3D similarity method for scaffold hopping from known drugs or natural ligands to new chemotypes. , 2004, Journal of medicinal chemistry.

[17]  Andreas Bender,et al.  Molecular Similarity Searching Using Atom Environments, Information-Based Feature Selection, and a Naïve Bayesian Classifier , 2004, J. Chem. Inf. Model..

[18]  R. Glen,et al.  Molecular similarity: a key technique in molecular informatics. , 2004, Organic & biomolecular chemistry.

[19]  Andreas Bender,et al.  Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT 2D): Evaluation of Performance , 2004, J. Chem. Inf. Model..

[20]  P. Willett,et al.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. , 2004, Organic & biomolecular chemistry.

[21]  M. Vieth,et al.  Kinomics-structural biology and chemogenomics of kinase inhibitors and targets. , 2004, Biochimica et biophysica acta.

[22]  Andrew C. Good,et al.  Measuring CAMD technique performance: A virtual screening case study in the design of validation experiments , 2004, J. Comput. Aided Mol. Des..

[23]  Jérôme Hert,et al.  Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures , 2004, J. Chem. Inf. Model..

[24]  Andreas Bender,et al.  A Discussion of Measures of Enrichment in Virtual Screening: Comparing the Information Content of Descriptors with Increasing Levels of Sophistication , 2005, J. Chem. Inf. Model..

[25]  G. Schneider,et al.  Comparison of Three Holographic Fingerprint Descriptors and their Binary Counterparts , 2005 .

[26]  D. Spring Chemical Genetics to Chemical Genomics: Small Molecules Offer Big Insights , 2005 .

[27]  Qiang Zhang,et al.  Scaffold hopping through virtual screening using 2D and 3D similarity descriptors: ranking, voting, and consensus scoring. , 2006, Journal of medicinal chemistry.

[28]  Peter Willett,et al.  Analysis of Data Fusion Methods in Virtual Screening: Similarity and Group Fusion , 2006, J. Chem. Inf. Model..

[29]  A. Bender,et al.  Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME. , 2006, IDrugs : the investigational drugs journal.

[30]  Miklos Feher,et al.  Novel 2D Fingerprints for Ligand-Based Virtual Screening , 2006, J. Chem. Inf. Model..

[31]  Valerie J. Gillet,et al.  Analysis of Data Fusion Methods in Virtual Screening: Theoretical Model , 2006, J. Chem. Inf. Model..

[32]  Miklos Feher,et al.  The Use of Consensus Scoring in Ligand-Based Virtual Screening , 2006, J. Chem. Inf. Model..

[33]  Gisbert Schneider,et al.  Scaffold‐Hopping: How Far Can You Jump? , 2006 .

[34]  P. Clemons,et al.  Chemogenomic data analysis: prediction of small-molecule targets and the advent of biological fingerprint. , 2007, Combinatorial chemistry & high throughput screening.

[35]  Jeremy L. Jenkins,et al.  Clustering and Rule-Based Classifications of Chemical Structures Evaluated in the Biological Activity Space , 2007, J. Chem. Inf. Model..

[36]  Andreas Bender,et al.  Understanding False Positives in Reporter Gene Assays: in Silico Chemogenomics Approaches To Prioritize Cell-Based HTS Data , 2007, J. Chem. Inf. Model..

[37]  Michael J. Keiser,et al.  Relating protein pharmacology by ligand chemistry , 2007, Nature Biotechnology.

[38]  Knut Baumann,et al.  Impact of Benchmark Data Set Topology on the Validation of Virtual Screening Methods: Exploration and Quantification by Spatial Statistics , 2008, J. Chem. Inf. Model..

[39]  Thomas J. Crisman,et al.  Which aspects of HTS are empirically correlated with downstream success? , 2008, Current opinion in drug discovery & development.

[40]  George D. Purvis Size-intensive descriptors , 2008, J. Comput. Aided Mol. Des..