Introduction to Similarity Searching in Chemistry

The similarity concept and its database implementation – similarity searching, are overviewed in the context of chemoinformatics. Similarity is defined in terms of matches/overlap, dissimilarity in terms of mismatches/difference, for qualitative/quantitative characteristics. Similarity, dissimilarity and composite measures are constructed from similarity or/and dissimilarity components. Asymmetric measures are constructed by unequal weighting of dissimilarity components. Whole objects or local regions of them are compared, yielding global or local similarity. Asymmetric local similarity is obtained by treating the objects in the comparison unequally, e.g. by ignoring parts of them. Global characteristics provide overall descriptions of objects, local characteristics provide sufficient locational information for object alignment/superposition to be effected. Similar objects are likely to have similar properties – similar property principle. In chemical similarity searching, molecules, fragments of molecules, reactions, mixtures, journal articles, etc. are selected as objects of interest. The selection of characteristics and their encoding is illustrated using the atom pair and topological torsion descriptors, as well as their variants of increased fuzziness. Similarity measure selection is still very much a matter of trial and error. Standard query object specification is made easier by using query by example, multiple searches using a single query yield a highly informative hyperlinked screen, and joint queries involve more than one object. Similarity scores illustrate results from similarity searches and measures of their effectiveness. Areas of application include direct and reverse property prediction, data mining, virtual screening, diversity analysis, pharmacophore searching, ligand docking, structure elucidation, pattern matching, and signature analysis. * Dedicated to the memory of Professor Oscar E. Polansky.

[1]  Gordon M. Crippen VRI: 3D QSAR at variable resolution , 1999, J. Comput. Chem..

[2]  Robert P. Sheridan,et al.  Chemical Similarity Using Physiochemical Property Descriptors , 1996, J. Chem. Inf. Comput. Sci..

[3]  S. L. Dixon,et al.  The hidden component of size in two-dimensional fragment descriptors: side effects on sampling in bioactive libraries. , 1999, Journal of medicinal chemistry.

[4]  Douglas J. Klein,et al.  On some solved and unsolved problems of chemical graph theory , 1986 .

[5]  Ramon Carbó,et al.  LCAO–MO similarity measures and taxonomy† , 1987 .

[6]  E. Fluder,et al.  Latent semantic structure indexing (LaSSI) for defining chemical similarity. , 2001, Journal of medicinal chemistry.

[7]  P. Surján,et al.  An observable-based interpretation of electronic wavefunctions: application to “hypervalent” molecules , 1992 .

[8]  Emili Besalú,et al.  A general survey of molecular quantum similarity , 1998 .

[9]  Lavery,et al.  Mathematical Challenges from Theoretical/Computational Chemistry. , 1995 .

[10]  Igor I. Baskin,et al.  Molecular Similarity. 1. Analytical Description of the Set of Graph Similarity Measures , 1998, J. Chem. Inf. Comput. Sci..

[11]  Matthias Rarey,et al.  Feature trees: A new molecular similarity measure based on tree matching , 1998, J. Comput. Aided Mol. Des..

[12]  Robert P. Sheridan,et al.  The Most Common Chemical Replacements in Drug-Like Compounds , 2002, J. Chem. Inf. Comput. Sci..

[13]  Matt Challacombe,et al.  Maximum similarity orbitals for analysis of the electronic excited states , 1991 .

[14]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..

[15]  V. Batagelj,et al.  Comparing resemblance measures , 1995 .

[16]  Robert P. Sheridan,et al.  Protocols for Bridging the Peptide to Nonpeptide Gap in Topological Similarity Searches , 2001, J. Chem. Inf. Comput. Sci..

[17]  Douglas J. Klein,et al.  Partial Orderings in Chemistry , 1997, J. Chem. Inf. Comput. Sci..

[18]  R D Hull,et al.  Chemical similarity searches using latent semantic structural indexing (LaSSI) and comparison to TOPOSIM. , 2001, Journal of medicinal chemistry.

[19]  David L. Cooper,et al.  A novel approach to molecular similarity , 1989, J. Comput. Aided Mol. Des..

[20]  Asiri Nanayakkara,et al.  Similarity of atoms in molecules , 1993 .

[21]  Robert P. Sheridan,et al.  A Method for Visualizing Recurrent Topological Substructures in Sets of Active Molecules , 1998, J. Chem. Inf. Comput. Sci..

[22]  Paul G. Mezey,et al.  The holographic electron density theorem and quantum similarity measures , 1999 .

[23]  Peter Willett,et al.  Bit-String Methods for Selective Compound Acquisition , 2000, J. Chem. Inf. Comput. Sci..

[24]  I. Turksen,et al.  Measurement of Membership Functions: Theoretical and Empirical Work , 2000 .

[25]  J. D. Petke Cumulative and discrete similarity analysis of electrostatic potentials and fields , 1993, J. Comput. Chem..

[26]  Marvin Johnson,et al.  Concepts and applications of molecular similarity , 1990 .

[27]  W. Graham Richards,et al.  Alignment of 3D-Structures by the Method of 2D-Projections , 1999, J. Chem. Inf. Comput. Sci..

[28]  W. Graham Richards,et al.  Partial Molecular Alignment via Local Structure Analysis , 2000, J. Chem. Inf. Comput. Sci..

[29]  Jerzy Cioslowski,et al.  Quantifying the Hammond postulate : intramolecular proton transfer in substituted hydrogen catecholate anions , 1991 .

[30]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[31]  J. Cioslowski Electronic Wavefunctions Analysis , 2002 .

[32]  John C. Gower,et al.  Measures of Similarity, Dissimilarity and Distance , 1985 .

[33]  Pei Wang,et al.  The interpretation of fuzziness , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[34]  John Bradshaw,et al.  Identification of Biological Activity Profiles Using Substructural Analysis and Genetic Algorithms , 1998, J. Chem. Inf. Comput. Sci..

[35]  P. Sneath Relations between chemical structure and biological activity in peptides. , 1966, Journal of theoretical biology.

[36]  Darren V. S. Green,et al.  Selecting Combinatorial Libraries to Optimize Diversity and Physical Properties , 1999, J. Chem. Inf. Comput. Sci..

[37]  Andrew C. Good,et al.  Explicit Calculation of 3D Molecular Similarity , 2002 .

[38]  Guenter Grethe,et al.  Similarity searching in REACCS. A new tool for the synthetic chemist , 1990, J. Chem. Inf. Comput. Sci..

[39]  Peter Willett,et al.  Structural Similarity Measures for Database Searching , 2002 .

[40]  Matthias Rarey,et al.  Similarity searching in large combinatorial chemistry spaces , 2001, J. Comput. Aided Mol. Des..

[41]  Jordi Mestres,et al.  MIMIC: A molecular-field matching program. Exploiting applicability of molecular similarity approaches , 1997, J. Comput. Chem..

[42]  R. Carbó-Dorca,et al.  Identification of Active Molecular Sites Using Quantum-Self-Similarity Measures. , 2001 .

[43]  A Williams Recent advances in NMR prediction and automated structure elucidation software. , 2000, Current opinion in drug discovery & development.

[44]  Eugene D. Fleischmann,et al.  Assessing molecular similarity from results of ab initio electronic structure calculations , 1991 .

[45]  Gisbert Schneider,et al.  Handbook of Chemoinformatics. From Data to Knowledge. Vols. 1–4. Edited by Johann Gasteiger. , 2004 .

[46]  Robert Ponec,et al.  A novel approach to the characterization of molecular similarity. The 2nd order similarity index , 1990 .

[47]  Simone Santini,et al.  Similarity Measures , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  Robert P. Sheridan,et al.  The Centroid Approximation for Mixtures: Calculating Similarity and Deriving Structure-Activity Relationships , 2000, J. Chem. Inf. Comput. Sci..

[49]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[50]  Johannes H. Voigt,et al.  Comparison of the NCI Open Database with Seven Large Chemical Structural Databases , 2001, J. Chem. Inf. Comput. Sci..

[51]  S. Stanley Young,et al.  Automated Pharmacophore Identification for Large Chemical Data Sets. , 1999 .

[52]  P Willett,et al.  Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. , 2002, Combinatorial chemistry & high throughput screening.

[53]  Michael F. Delaney,et al.  Optimization of a similarity metric for library searching of highly compressed vapor-phase infrared spectra , 1985, J. Chem. Inf. Comput. Sci..

[54]  Ramon Carbo,et al.  How similar is a molecule to another? An electron density measure of similarity between two molecular structures , 1980 .

[55]  Alan H. Lipkus,et al.  Similarity searching on CAS Registry substances. 2. 2D structural similarity , 1994, J. Chem. Inf. Comput. Sci..

[56]  K. Sen,et al.  Molecular Similarity I , 1995 .

[57]  Johnz Willett Similarity and Clustering in Chemical Information Systems , 1987 .

[58]  Robert P. Sheridan,et al.  Chemical Similarity Using Geometric Atom Pair Descriptors , 1996, J. Chem. Inf. Comput. Sci..

[59]  Ramaswamy Nilakantan,et al.  Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors , 1987, J. Chem. Inf. Comput. Sci..

[60]  R D Hull,et al.  Mining the chemical quarry with joint chemical probes: an application of latent semantic structure indexing (LaSSI) and TOPOSIM (Dice) to chemical database mining. , 2001, Journal of medicinal chemistry.