Molecular similarity measures.

Molecular similarity is a pervasive concept in chemistry. It is essential to many aspects of chemical reasoning and analysis and is perhaps the fundamental assumption underlying medicinal chemistry. Dissimilarity, the complement of similarity, also plays a major role in a growing number of applications of molecular diversity in combinatorial chemistry, high-throughput screening, and related fields. How molecular information is represented, called the representation problem, is important to the type of molecular similarity analysis (MSA) that can be carried out in any given situation. In this work, four types of mathematical structure are used to represent molecular information: sets, graphs, vectors, and functions. Molecular similarity is a pairwise relationship that induces structure into sets of molecules, giving rise to the concept of chemical space. Although all three concepts - molecular similarity, molecular representation, and chemical space - are treated in this chapter, the emphasis is on molecular similarity measures. Similarity measures, also called similarity coefficients or indices, are functions that map pairs of compatible molecular representations that are of the same mathematical form into real numbers usually, but not always, lying on the unit interval. This chapter presents a somewhat pedagogical discussion of many types of molecular similarity measures, their strengths and limitations, and their relationship to one another. An expanded account of the material on chemical spaces presented in the first edition of this book is also provided. It includes a discussion of the topography of activity landscapes and the role that activity cliffs in these landscapes play in structure-activity studies.

[1]  Michael S. Lajiness,et al.  A Practical Strategy for Directed Compound Acquisition , 2005 .

[2]  Romualdo Benigni,et al.  Analysis of Distance Matrices for Studying Data Structures and Separating, Classes , 1993 .

[3]  Michel Petitjean,et al.  Three-Dimensional Pattern Recognition from Molecular Distance Minimization , 1996, J. Chem. Inf. Comput. Sci..

[4]  Qishi Du,et al.  Heuristic lipophilicity potential for computer-aided rational drug design , 1997, J. Comput. Aided Mol. Des..

[5]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[6]  Dimitris K. Agrafiotis,et al.  Nonlinear Mapping Networks , 2000, J. Chem. Inf. Comput. Sci..

[7]  Tudor I. Oprea,et al.  Chemography: the Art of Navigating in Chemical Space , 2000 .

[8]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[9]  Peter C. Jurs,et al.  Descriptions of molecular shape applied in studies of structure/activity and structure/property relationships , 1987 .

[10]  Marvin Johnson,et al.  Concepts and applications of molecular similarity , 1990 .

[11]  Donald E. Williams Improved intermolecular force field for crystalline oxohydrocarbons including OHO hydrogen bonding , 2001 .

[12]  Gerald M. Maggiora,et al.  On Outliers and Activity Cliffs-Why QSAR Often Disappoints , 2006, J. Chem. Inf. Model..

[13]  Andrew C. Good,et al.  Utilization of Gaussian functions for the rapid evaluation of molecular similarity , 1992, J. Chem. Inf. Comput. Sci..

[14]  Christian Lemmen,et al.  Computational methods for the structural alignment of molecules , 2000, J. Comput. Aided Mol. Des..

[15]  Osman F. Güner,et al.  Pharmacophore perception, development, and use in drug design , 2000 .

[16]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[17]  Xin Chen,et al.  Asymmetry of Chemical Similarity , 2007, ChemMedChem.

[18]  Dimitris K. Agrafiotis,et al.  A Geodesic Framework for Analyzing Molecular Similarities , 2003, J. Chem. Inf. Comput. Sci..

[19]  A. Balaban,et al.  Topological Indices and Related Descriptors in QSAR and QSPR , 2003 .

[20]  Hans Bandemer,et al.  Fuzzy Data Analysis , 1992 .

[21]  J. Mason,et al.  New 4-point pharmacophore method for molecular similarity and diversity applications: overview of the method and applications, including a novel approach to the design of combinatorial libraries containing privileged substructures. , 1999, Journal of medicinal chemistry.

[22]  Jérôme Hert,et al.  Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures , 2004, J. Chem. Inf. Model..

[23]  Jason A. Rush Cell-Based Methods for Sampling in High-Dimensional Spaces , 1999 .

[24]  Milan Randic Representation of molecular graphs by basic graphs , 1992, J. Chem. Inf. Comput. Sci..

[25]  P. M. Dean,et al.  Molecular Similarity in Drug Design , 2007 .

[26]  V. Kvasničk,et al.  Two metrics for a graph-theoretical model of organic chemistry , 1989 .

[27]  Peter Willett,et al.  Similarity searching in files of three-dimensional chemical structures: Representation and searching of molecular electrostatic potentials using field-graphs , 1997, J. Comput. Aided Mol. Des..

[28]  D. Agrafiotis,et al.  Nonlinear mapping of massive data sets by fuzzy clustering and neural networks , 2001 .

[29]  Simon K. Kearsley,et al.  An alternative method for the alignment of molecular structures: Maximizing electrostatic and steric overlap , 1990 .

[30]  Arthur M. Doweyko,et al.  QSAR: dead or alive? , 2008, J. Comput. Aided Mol. Des..

[31]  Guido Sello,et al.  Similarity Measures: Is It Possible To Compare Dissimilar Structures? , 1998, J. Chem. Inf. Comput. Sci..

[32]  Paul G. Mezey,et al.  Shape Group Analysis of Molecular Similarity: Shape Similarity of Six-Membered Aromatic Ring Systems , 1995, J. Chem. Inf. Comput. Sci..

[33]  M. Stephens EDF Statistics for Goodness of Fit and Some Comparisons , 1974 .

[34]  Gerald M. Maggiora,et al.  Ab initio calculations on large molecules using molecular fragments. Preliminary investigations , 1969 .

[35]  D. Donoho,et al.  Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Matthias Rarey,et al.  Feature trees: A new molecular similarity measure based on tree matching , 1998, J. Comput. Aided Mol. Des..

[37]  J. Devillers,et al.  Non‐linear mapping for structure‐activity and structure‐property modelling , 1993 .

[38]  Mathias Wawer,et al.  Navigating structure-activity landscapes. , 2009, Drug discovery today.

[39]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[40]  Frank Harary,et al.  Graph Theory , 2016 .

[41]  P. Willett,et al.  Combination of molecular similarity measures using data fusion , 2000 .

[42]  Peter Willett,et al.  Maximum common subgraph isomorphism algorithms for the matching of chemical structures , 2002, J. Comput. Aided Mol. Des..

[43]  Mark A Olson,et al.  An efficient hybrid explicit/implicit solvent method for biomolecular simulations , 2004, J. Comput. Chem..

[44]  J. Bajorath,et al.  SAR index: quantifying the nature of structure-activity relationships. , 2007, Journal of medicinal chemistry.

[45]  Ab initio simulation of chemical shift effects from metal ion binding in Bacitracin A , 2000 .

[46]  Chabane Djeraba,et al.  Mathematical Tools for Data Mining: Set Theory, Partial Orders, Combinatorics , 2008, Advanced Information and Knowledge Processing.

[47]  R. Cramer,et al.  Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. , 1988, Journal of the American Chemical Society.

[48]  K. M. Smith,et al.  Novel software tools for chemical diversity , 1998 .

[49]  Clemencia Pinilla,et al.  A Similarity‐based Data‐fusion Approach to the Visual Characterization and Comparison of Compound Databases , 2007, Chemical biology & drug design.

[50]  S. Lakshmivarahan,et al.  Probability and Random Processes , 2007 .

[51]  Thomas Lengauer,et al.  RigFit: A new approach to superimposing ligand molecules , 1998, German Conference on Bioinformatics.

[52]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[53]  Vladimír Kvasnička,et al.  Chemical and reaction metrics for graph-theoretical model of organic chemistry , 1991 .

[54]  Jordi Mestres,et al.  A General Analysis of Field-Based Molecular Similarity Indices , 2002 .

[55]  R. S. Mulliken Electronic Population Analysis on LCAO–MO Molecular Wave Functions. I , 1955 .

[56]  Tamar Schlick,et al.  An Efficient Projection Protocol for Chemical Databases: Singular Value Decomposition Combined with Truncated-Newton Minimization , 2000, J. Chem. Inf. Comput. Sci..

[57]  J. Bajorath,et al.  Structure-activity relationship anatomy by network-like similarity graphs and local structure-activity relationship indices. , 2008, Journal of medicinal chemistry.

[58]  Michel Petitjean,et al.  Geometric molecular similarity from volume‐based distance minimization: Application to saxitoxin and tetrodotoxin , 1995, J. Comput. Chem..

[59]  P. Löwdin On Linear Algebra, the Least Square Method, and the Search for Linear Relations by Regression Analysis in Quantum Chemistry and Other Sciences , 1992 .

[60]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[61]  Robert P Sheridan,et al.  Why do we need so many chemical similarity search methods? , 2002, Drug discovery today.

[62]  José L. Medina-Franco,et al.  Characterization of Activity Landscapes Using 2D and 3D Similarity Methods: Consensus Activity Cliffs , 2009, J. Chem. Inf. Model..

[63]  Dimitris K. Agrafiotis,et al.  Stochastic proximity embedding , 2003, J. Comput. Chem..

[64]  J. A. Grant,et al.  A fast method of molecular shape comparison: A simple application of a Gaussian description of molecular shape , 1996, J. Comput. Chem..

[65]  Y. Martin Diverse viewpoints on computational aspects of molecular diversity. , 2001, Journal of combinatorial chemistry.

[66]  Robert D Clark,et al.  Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. , 1996, Journal of medicinal chemistry.

[67]  B. C. Carlson,et al.  Orthogonalization Procedures and the Localization of Wannier Functions , 1957 .

[68]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[69]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[70]  Rajarshi Guha,et al.  Assessing How Well a Modeling Protocol Captures a Structure-Activity Landscape , 2008, J. Chem. Inf. Model..

[71]  Peter Willett,et al.  Enhancing the Effectiveness of Virtual Screening by Fusing Nearest Neighbor Lists: A Comparison of Similarity Coefficients , 2004, J. Chem. Inf. Model..

[72]  Gerald M. Maggiora,et al.  Field-Based Similarity Forcing in Energy Minimization and Molecular Matching , 1999, Pacific Symposium on Biocomputing.

[73]  Milan Randić,et al.  Correlation of enthalphy of octanes with orthogonal connectivity indices , 1991 .

[74]  A. Pohorille,et al.  Free energy calculations : theory and applications in chemistry and biology , 2007 .

[75]  Gisbert Schneider,et al.  Scaffold‐Hopping: How Far Can You Jump? , 2006 .

[76]  Veerabahu Shanmugasundaram,et al.  An information-theoretic characterization of partitioned property spaces , 2005 .

[77]  Peter Willett,et al.  Use of a maximum common subgraph algorithm in the automatic identification of ostensible bond changes occurring in chemical reactions , 1981, J. Chem. Inf. Comput. Sci..

[78]  Lemont B. Kier,et al.  An Index of Molecular Flexibility from Kappa Shape Attributes , 1989 .

[79]  A. Tversky Features of Similarity , 1977 .

[80]  George J. Klir,et al.  Fuzzy sets and fuzzy logic - theory and applications , 1995 .

[81]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[82]  Sadaaki Miyamoto,et al.  Fuzzy Sets in Information Retrieval and Cluster Analysis , 1990, Theory and Decision Library.

[83]  Igor I. Baskin,et al.  Molecular Similarity. 1. Analytical Description of the Set of Graph Similarity Measures , 1998, J. Chem. Inf. Comput. Sci..

[84]  Milan Randic,et al.  Resolution of ambiguities in structure-property studies by use of orthogonal descriptors , 1991, J. Chem. Inf. Comput. Sci..

[85]  A. Kaufmann,et al.  Introduction to fuzzy arithmetic : theory and applications , 1986 .

[86]  Rajarshi Guha,et al.  Structure-Activity Landscape Index: Identifying and Quantifying Activity Cliffs , 2008, J. Chem. Inf. Model..

[87]  P. Jurs,et al.  Development and use of charged partial surface area structural descriptors in computer-assisted quantitative structure-property relationship studies , 1990 .

[88]  Romualdo Benigni EVE, a Distance Based Approach for Discriminating Nonlinearly Separable Groups , 1994 .

[89]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[90]  Mark A. Johnson A review and examination of the mathematical spaces underlying molecular similarity analysis , 1989 .

[91]  Peter Willett,et al.  Evaluation of molecular similarity and molecular diversity methods using biological activity data. , 2004, Methods in molecular biology.

[92]  R. Christoffersen Basic Principles and Techniques of Molecular Quantum Mechanics , 1998 .

[93]  E. Oja,et al.  Independent Component Analysis , 2013 .

[94]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[95]  Julian Tirado-Rives,et al.  Potential energy functions for atomic-level simulations of water and organic and biomolecular systems. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[96]  G. Schneider,et al.  Scaffold‐Hopping Potential of Ligand‐Based Similarity Concepts , 2006, ChemMedChem.

[97]  M. Johnson,et al.  Relating metrics, lines and variables defined on graphs to problems in medicinal chemistry , 1985 .

[98]  David M. Rocke,et al.  Predicting ligand binding to proteins by affinity fingerprinting. , 1995, Chemistry & biology.

[99]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[100]  Gerhard Klebe,et al.  Superposition of molecules: Electron density fitting by application of fourier transforms , 1997 .

[101]  J. D. Petke Cumulative and discrete similarity analysis of electrostatic potentials and fields , 1993, J. Comput. Chem..

[102]  Jürgen Bajorath,et al.  Database Searching for Compounds with Similar Biological Activity Using Short Binary Bit String Representations of Molecules , 1999, J. Chem. Inf. Comput. Sci..

[103]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[104]  J. Gower,et al.  Metric and Euclidean properties of dissimilarity coefficients , 1986 .

[105]  T. Insel,et al.  NIH Molecular Libraries Initiative , 2004, Science.

[106]  Jordi Mestres,et al.  MIMIC: A molecular‐field matching program. Exploiting applicability of molecular similarity approaches , 1997 .

[107]  Pentti Kanerva,et al.  Sparse Distributed Memory , 1988 .

[108]  Peter Willett,et al.  Analysis of Data Fusion Methods in Virtual Screening: Similarity and Group Fusion , 2006, J. Chem. Inf. Model..

[109]  Schmid,et al.  "Scaffold-Hopping" by Topological Pharmacophore Search: A Contribution to Virtual Screening. , 1999, Angewandte Chemie.

[110]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[111]  Robert L. Jernigan,et al.  A New Class of Molecular Shape Descriptors, 1. Theory and Properties , 2002, J. Chem. Inf. Comput. Sci..

[112]  I. Jolliffe Principal Component Analysis , 2002 .

[113]  Igor I. Baskin,et al.  On the Basis of Invariants of Labeled Molecular Graphs , 1995, J. Chem. Inf. Comput. Sci..

[114]  Dimitris K. Agrafiotis,et al.  Multidimensional scaling and visualization of large molecular similarity tables , 2001 .

[115]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[116]  Thomas R. Hagadone,et al.  Molecular substructure similarity searching: efficient retrieval in two-dimensional structure databases , 1992, J. Chem. Inf. Comput. Sci..

[117]  Johnz Willett Similarity and Clustering in Chemical Information Systems , 1987 .

[118]  Stephen R. Johnson,et al.  The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy) , 2008, J. Chem. Inf. Model..

[119]  J. Kruskal The Relationship between Multidimensional Scaling and Clustering , 1977 .

[120]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[121]  R. Glen,et al.  Molecular similarity: a key technique in molecular informatics. , 2004, Organic & biomolecular chemistry.

[122]  N. Trinajstic Chemical Graph Theory , 1992 .

[123]  J. Tinsley Oden,et al.  Applied functional analysis , 1996 .

[124]  Milan Randic,et al.  Fitting of nonlinear regressions by orthogonalized power series , 1993, J. Comput. Chem..

[125]  Venkatarama Krishnan,et al.  Probability and Random Processes: Krishnan/Probability and Random Processes , 2006 .

[126]  Jordi Mestres,et al.  A molecular-field-based similarity study of non-nucleoside HIV-1 reverse transcriptase inhibitors , 1999, J. Comput. Aided Mol. Des..

[127]  Andrew C. Good,et al.  Explicit Calculation of 3D Molecular Similarity , 2002 .

[128]  Sun-Yuan Kung,et al.  Principal Component Neural Networks: Theory and Applications , 1996 .

[129]  Edward E. Hodgkin,et al.  Molecular similarity based on electrostatic potential and electric field , 1987 .

[130]  Pedro J. Ballester,et al.  Ultrafast shape recognition for similarity search in molecular databases , 2007, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[131]  A. Szabó,et al.  Modern quantum chemistry : introduction to advanced electronic structure theory , 1982 .

[132]  John C. Gower Distance matrices and their Euclidean approximation , 1983 .

[133]  Gerald M. Maggiora,et al.  Molecular Basis SetsA General Similarity-Based Approach for Representing Chemical Spaces , 2007, J. Chem. Inf. Model..

[134]  José L. Medina-Franco,et al.  Visualization of the Chemical Space in Drug Discovery , 2008 .

[135]  P. Labute,et al.  Flexible alignment of small molecules. , 2001, Journal of medicinal chemistry.

[136]  Ralf Herbrich,et al.  Learning Kernel Classifiers , 2001 .

[137]  Caroline M. Eastman,et al.  Response: Introduction to fuzzy arithmetic: Theory and applications : Arnold Kaufmann and Madan M. Gupta, Van Nostrand Reinhold, New York, 1985 , 1987, Int. J. Approx. Reason..

[138]  P. Groenen,et al.  Modern multidimensional scaling , 1996 .