International Union of Pure and Applied Chemistry, LUMO energy ± The Lowest Unoccupied Molecular Orbital (LUMO)

Although the concept of similarity is a convenient for humans, a formal definition of similarity between chemical compounds is needed to enable automatic decision-making. The objective of similarity measures in toxicology and drug design is to allow assessment of chemical activities. The ideal similarity measure should be relevant to the activity of interest. The relevance could be established by exploiting the knowledge about fundamental chemical and biological processes responsible for the activity. Unfortunately, this knowledge is rarely available and therefore different approximations have been developed based on similarity between structures or descriptor values. Various methods are reviewed, ranging from two-dimensional, three-dimensional and field approaches to recent methods based on “Atoms in Molecules” theory. All these methods attempt to describe chemical compounds by a set of numerical values and define some means for comparison between them. The review provides analysis of potential pitfalls of this methodology – loss of information in the representations of molecular structures – the relevance of a particular representation and chosen similarity measure to the activity. A brief review of known methods for descriptor selection is also provided. The popular “neighborhood behavior” principle is criticized, since proximity with respect to descriptors does not necessarily mean proximity with respect to activity. Structural similarity should also be used with care, as it does not always imply similar activity, as shown by examples. We remind that similarity measures and classification techniques based on distances rely on certain data distribution assumptions. If these assumptions are not satisfied for a given dataset, the results could be misleading. A discussion on similarity in descriptor space in the context of applicability domain assessment of QSAR models is also provided. Finally, it is shown that descriptor based similarity analysis is prone to errors if the relationship between the activity and the descriptors has not been previously established. A justification for the usage of a particular similarity measure should be provided for every specific activity by expert knowledge or derived by data modeling techniques.

[1]  H. Wiener Structural determination of paraffin boiling points. , 1947, Journal of the American Chemical Society.

[2]  Robert W. Taft,et al.  Polar and Steric Substituent Constants for Aliphatic and o-Benzoate Groups from Rates of Esterification and Hydrolysis of Esters1 , 1952 .

[3]  Robert W. Taft,et al.  The General Nature of the Proportionality of Polar Effects of Substituent Groups in Organic Chemistry , 1953 .

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  R. C. Weast CRC Handbook of Chemistry and Physics , 1973 .

[6]  M. Randic Characterization of molecular branching , 1975 .

[7]  N. Trinajstic,et al.  Information theory, distance matrix, and molecular branching , 1977 .

[8]  R. Bader,et al.  Quantum topology of molecular charge distributions. II. Molecular structure and its change , 1979 .

[9]  A. J. Duke,et al.  Quantum topology of molecular charge distributions. 1 , 1979 .

[10]  Ramon Carbo,et al.  How similar is a molecule to another? An electron density measure of similarity between two molecular structures , 1980 .

[11]  Alexandru T. Balaban,et al.  Topological indices based on topological distances in molecular graphs , 1983 .

[12]  Lemont B. Kier,et al.  A Shape Index from Molecular Graphs , 1985 .

[13]  Edward E. Hodgkin,et al.  A semi-empirical method for calculating molecular similarity , 1986 .

[14]  W. Graham Richards,et al.  Quantitative measures of similarity between pharmacologically active compounds , 1986 .

[15]  D. Walters,et al.  Case studies of the application of molecular shape analysis to elucidate drug action , 1986 .

[16]  Paul G. Mezey,et al.  The shape of molecular charge distributions: Group theory without symmetry , 1987 .

[17]  V. Gold Compendium of chemical terminology , 1987 .

[18]  Johnz Willett Similarity and Clustering in Chemical Information Systems , 1987 .

[19]  Edward E. Hodgkin,et al.  Molecular similarity based on electrostatic potential and electric field , 1987 .

[20]  R. Cramer,et al.  Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. , 1988, Journal of the American Chemical Society.

[21]  Subhash C. Basak,et al.  Determining structural similarity of chemicals using graph-theoretic indices , 1988, Discret. Appl. Math..

[22]  Gustavo A. Arteca,et al.  Shape group studies of molecular similarity and regioselectivity in chemical reactions , 1988 .

[23]  F. Burden Molecular identification number for substructure searches , 1989, J. Chem. Inf. Comput. Sci..

[24]  Marvin Johnson,et al.  Concepts and applications of molecular similarity , 1990 .

[25]  R. Bader Atoms in molecules : a quantum theory , 1990 .

[26]  Ferran Sanz,et al.  Automatic search for maximum similarity between molecular electrostatic potential distributions , 1991, J. Comput. Aided Mol. Des..

[27]  Gustavo A. Arteca,et al.  A complete shape characterization for molecular charge densities represented by Gaussian‐type functions , 1991 .

[28]  Jeremy G. Vinter,et al.  Electrostatics and computational modelling Editorial overview , 1991, J. Comput. Aided Mol. Des..

[29]  W. Graham Richards,et al.  Similarity of molecular shape , 1991, J. Comput. Aided Mol. Des..

[30]  A Burger,et al.  Isosterism and bioisosterism in drug design. , 1991, Progress in drug research. Fortschritte der Arzneimittelforschung. Progres des recherches pharmaceutiques.

[31]  Eugene D. Fleischmann,et al.  Assessing molecular similarity from results of ab initio electronic structure calculations , 1991 .

[32]  Bojan Mohar,et al.  Laplace eigenvalues of graphs - a survey , 1992, Discret. Math..

[33]  Guido Sello,et al.  Reaction prediction: the suggestions of the Beppe program , 1992, J. Chem. Inf. Comput. Sci..

[34]  Johann Gasteiger,et al.  Similarity concepts for the planning of organic reactions and syntheses , 1992, J. Chem. Inf. Comput. Sci..

[35]  Robert Ponec,et al.  Similarity ideas in the theory of pericyclic reactivity , 1992, J. Chem. Inf. Comput. Sci..

[36]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[37]  William Fisanick,et al.  Similarity searching on CAS Registry substances. 1. Global molecular property and generic atom triangle geometric searching , 1992, J. Chem. Inf. Comput. Sci..

[38]  Alexander J. Lawson Organic reaction similarity in information processing , 1992, J. Chem. Inf. Comput. Sci..

[39]  H. Kubinyi QSAR: Hansch Analysis and Related Approaches: Kubinyi/QSAR , 1993 .

[40]  W. G. Richards,et al.  Rapid evaluation of shape similarity using Gaussian functions , 1993, J. Chem. Inf. Comput. Sci..

[41]  H. Kubinyi QSAR : Hansch analysis and related approaches , 1993 .

[42]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[43]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[44]  Peter Willett,et al.  Similarity Searching and Clustering of Chemical-Structure Databases Using Molecular Property Data , 1994, J. Chem. Inf. Comput. Sci..

[45]  Ron Kohavi,et al.  Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology , 1995, KDD.

[46]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[47]  P Willett,et al.  Searching for pharmacophoric patterns in databases of three‐dimensional chemical structures , 1995, Journal of molecular recognition : JMR.

[48]  K. Sen,et al.  Molecular Similarity II , 1995 .

[49]  Alessandro Giuliani,et al.  The Integrated Use of Alternative Approaches for Predicting Toxic Hazard , 1995 .

[50]  W RuiterdeG.C.,et al.  Shape in chemistry: An introduction to molecular shape and topology , 1995 .

[51]  K. Sen,et al.  Molecular Similarity I , 1995 .

[52]  Paul G. Mezey,et al.  A High-Resolution Shape-Fragment MEDLA Database for Toxicological Shape Analysis of PAHs , 1996, J. Chem. Inf. Comput. Sci..

[53]  M. Karelson,et al.  Quantum-Chemical Descriptors in QSAR/QSPR Studies. , 1996, Chemical reviews.

[54]  E. LaVoie,et al.  Bioisosterism: A Rational Approach in Drug Design. , 1996, Chemical reviews.

[55]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[56]  Robert D Clark,et al.  Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. , 1996, Journal of medicinal chemistry.

[57]  Shu-Kun Lin Molecular Diversity Assessment: Logarithmic Relations of Information and Species Diversity and Logarithmic Relations of Entropy and Indistinguishability after Rejection of Gibbs Paradox of Entropy of Mixing , 1996 .

[58]  Robert P. Sheridan,et al.  Chemical Similarity Using Physiochemical Property Descriptors , 1996, J. Chem. Inf. Comput. Sci..

[59]  Paul G. Mezey,et al.  Theorems on Molecular Shape-Similarity Descriptors: External T-Plasters and Interior T-Aggregates , 1996, J. Chem. Inf. Comput. Sci..

[60]  B D Silverman,et al.  Comparative molecular moment analysis (CoMMA): 3D-QSAR without molecular superposition. , 1996, Journal of medicinal chemistry.

[61]  J M Blaney,et al.  Computational approaches for combinatorial library design and molecular diversity analysis. , 1997, Current opinion in chemical biology.

[62]  Paola Gramatica,et al.  SD-modelling and Prediction by WHIM Descriptors. Part 5. Theory Development and Chemical Meaning of WHIM Descriptors , 1997 .

[63]  R. D. Clark,et al.  Taming the combinatorial centipede , 1997 .

[64]  Yvonne C. Martin,et al.  The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding , 1997, J. Chem. Inf. Comput. Sci..

[65]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[66]  Milan Randic,et al.  On Characterization of Chemical Structure , 1997, J. Chem. Inf. Comput. Sci..

[67]  Dimitris K. Agrafiotis On the Use of Information Theory for Assessing Molecular Diversity , 1997, J. Chem. Inf. Comput. Sci..

[68]  Paola Gramatica,et al.  3D‐modelling and Prediction by WHIM Descriptors. Part 6. Application of WHIM Descriptors in QSAR Studies , 1997 .

[69]  H Matter,et al.  Random or rational design? Evaluation of diverse compound subsets from chemical structure databases. , 1998, Journal of medicinal chemistry.

[70]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[71]  John Bradshaw,et al.  Similarity and Dissimilarity Methods for Processing Chemical Structure Databases , 1998, Comput. J..

[72]  Igor I. Baskin,et al.  Molecular Similarity. 1. Analytical Description of the Set of Graph Similarity Measures , 1998, J. Chem. Inf. Comput. Sci..

[73]  Robert D. Clark,et al.  Virtual Compound Libraries: A New Approach to Decision Making in Molecular Discovery Research , 1998, J. Chem. Inf. Comput. Sci..

[74]  Darren R. Flower,et al.  On the Properties of Bit String-Based Measures of Chemical Similarity , 1998, J. Chem. Inf. Comput. Sci..

[75]  Frank R. Burden,et al.  New QSAR Methods Applied to Structure-Activity Mapping and Combinatorial Chemistry , 1999, J. Chem. Inf. Comput. Sci..

[76]  Robert S. Pearlman,et al.  Metric Validation and the Receptor-Relevant Subspace Concept , 1999, J. Chem. Inf. Comput. Sci..

[77]  Peter Willett,et al.  Evaluation of a novel molecular vibration-based descriptor (EVA) for QSAR studies: 2. Model validation using a benchmark steroid dataset , 1999, J. Comput. Aided Mol. Des..

[78]  Haruo Hosoya,et al.  Topological Index and Thermodynamic Properties, 5. How Can We Explain the Topological Dependency of Thermodynamic Properties of Alkanes with the Topology of Graphs? , 1999, J. Chem. Inf. Comput. Sci..

[79]  Paul L. A. Popelier,et al.  Quantum molecular similarity. 1. BCP space , 1999 .

[80]  Denis M. Bayada,et al.  Molecular Diversity and Representativity in Chemical Databases , 1999, J. Chem. Inf. Comput. Sci..

[81]  Howard J. Hamilton,et al.  Heuristic Measures of Interestingness , 1999, PKDD.

[82]  Thuy Dao,et al.  Comparative Spectra Analysis (CoSA): Spectra as Three-Dimensional Molecular Descriptors for the Prediction of Biological Activities , 1999, J. Chem. Inf. Comput. Sci..

[83]  Robert S. Boethling,et al.  Improved method for estimating bioconcentration/bioaccumulation factor from octanol/water partition coefficient , 1999 .

[84]  Paul L. A. Popelier,et al.  Quantum molecular similarity. Part 2: The relation between properties in BCP space and bond length , 1999 .

[85]  W. Todd Wipke,et al.  Quadratic Shape Descriptors. 1. Rapid Superposition of Dissimilar Molecules Using Geometrically Invariant Surface Descriptors , 2000, J. Chem. Inf. Comput. Sci..

[87]  Hugo Kubinyi,et al.  3D QSAR in drug design : theory, methods and applications , 2000 .

[88]  P Willett,et al.  Chemoinformatics - similarity and diversity in chemical libraries. , 2000, Current opinion in biotechnology.

[89]  Paul L. A. Popelier,et al.  Atoms in Molecules: An Introduction , 2000 .

[90]  Paul L. A. Popelier,et al.  Atoms in molecules , 2000 .

[91]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[92]  Reinhard Wilhelm,et al.  Shape Analysis , 2000, CC.

[93]  C. Breneman,et al.  Prediction of protein retention in ion-exchange systems using molecular descriptors obtained from crystal structure. , 2001, Analytical chemistry.

[94]  Luca Turin,et al.  Structure-odor relations : a modern perspective , 2001 .

[95]  Hiren Patel,et al.  A Novel Index for the Description of Molecular Linearity , 2001, J. Chem. Inf. Comput. Sci..

[96]  Systematic Study of the Quality of Various Quantum Similarity Descriptors. Use of the Autocorrelation Function and Principal Component Analysis , 2001 .

[97]  Anton J. Hopfinger,et al.  Estimation of Molecular Similarity Based on 4D-QSAR Analysis: Formalism and Validation , 2001, J. Chem. Inf. Comput. Sci..

[98]  Paul L. A. Popelier,et al.  Quantum Molecular Similarity. 3. QTMS Descriptors , 2001, J. Chem. Inf. Comput. Sci..

[99]  Y. Martin Diverse viewpoints on computational aspects of molecular diversity. , 2001, Journal of combinatorial chemistry.

[100]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[101]  Quantum topological atoms , 2002 .

[102]  Paul L. A. Popelier,et al.  Quantum topological molecular similarity. Part 5. Further development with an application to the toxicity of polychlorinated dibenzo-p-dioxinsThe IUPAC name for dibenzo-p-dioxin is dibenzo[b,e][1,4]dioxin.(PCDDs) , 2002 .

[103]  Paul L. A. Popelier,et al.  Quantum topological molecular similarity. Part 4. A QSAR study of cell growth inhibitory properties of substituted (E)-1-phenylbut-1-en-3-ones , 2002 .

[104]  D. Agrafiotis,et al.  Combinatorial informatics in the post-genomics era , 2002, Nature Reviews Drug Discovery.

[105]  Hugo Kubinyi,et al.  Similarity and Dissimilarity: A Medicinal Chemist’s View , 2002 .

[106]  H. Kubinyi,et al.  3D QSAR in drug design. , 2002 .

[107]  Hugo Kubinyi,et al.  Chemical similarity and biological activities , 2002 .

[108]  F. Yoshii,et al.  Structure-Odor Relationships: A Modern Perspective , 2003 .