When is Chemical Similarity Significant? The Statistical Distribution of Chemical Similarity Scores and Its Extreme Values

As repositories of chemical molecules continue to expand and become more open, it becomes increasingly important to develop tools to search them efficiently and assess the statistical significance of chemical similarity scores. Here, we develop a general framework for understanding, modeling, predicting, and approximating the distribution of chemical similarity scores and its extreme values in large databases. The framework can be applied to different chemical representations and similarity measures but is demonstrated here using the most common binary fingerprints with the Tanimoto similarity measure. After introducing several probabilistic models of fingerprints, including the Conditional Gaussian Uniform model, we show that the distribution of Tanimoto scores can be approximated by the distribution of the ratio of two correlated Normal random variables associated with the corresponding unions and intersections. This remains true also when the distribution of similarity scores is conditioned on the size of the query molecules to derive more fine-grained results and improve chemical retrieval. The corresponding extreme value distributions for the maximum scores are approximated by Weibull distributions. From these various distributions and their analytical forms, Z-scores, E-values, and p-values are derived to assess the significance of similarity scores. In addition, the framework also allows one to predict the value of standard chemical retrieval metrics, such as sensitivity and specificity at fixed thresholds, or receiver operating characteristic (ROC) curves at multiple thresholds, and to detect outliers in the form of atypical molecules. Numerous and diverse experiments that have been performed, in part with large sets of molecules from the ChemDB, show remarkable agreement between theory and empirical results.

[1]  Joseph S. Verducci,et al.  A Modification of the Jaccard–Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings , 2002, Technometrics.

[2]  M Rarey,et al.  Detailed analysis of scoring functions for virtual screening. , 2001, Journal of medicinal chemistry.

[3]  Daylight Theory Manual , 2011 .

[4]  Arjun K. Gupta,et al.  Product and quotient of correlated beta variables , 2009, Appl. Math. Lett..

[5]  Dennis H. Rouvray,et al.  Definition and role of similarity concepts in the chemical and physical sciences , 1992, J. Chem. Inf. Comput. Sci..

[6]  G. Marsaglia Ratios of Normal Variables and Ratios of Sums of Uniform Variables , 1965 .

[7]  David Rogers,et al.  Cheminformatics analysis and learning in a data pipelining environment , 2006, Molecular Diversity.

[8]  Jürgen Bajorath,et al.  Similarity Search Profiling Reveals Effects of Fingerprint Scaling in Virtual Screening. , 2005 .

[9]  P. Willett,et al.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. , 2004, Organic & biomolecular chemistry.

[10]  Jack D. Tubbs,et al.  A Note on the Ratio of Positively Correlated Gamma Variates , 1985 .

[11]  Pierre Baldi,et al.  ChemDB: a public database of small molecules and related chemoinformatics resources , 2005, Bioinform..

[12]  Eric P. Smith,et al.  An Introduction to Statistical Modeling of Extreme Values , 2002, Technometrics.

[13]  A. Tversky Features of Similarity , 1977 .

[14]  Pierre Baldi,et al.  BLASTing small molecules—statistics and extreme statistics of chemical similarity scores , 2008, ISMB.

[15]  W. Guida,et al.  The art and practice of structure‐based drug design: A molecular modeling perspective , 1996, Medicinal research reviews.

[16]  Tudor I. Oprea,et al.  Is There a Difference Between Leads and Drugs? A Historical Perspective. , 2001 .

[17]  Darren R. Flower,et al.  On the Properties of Bit String-Based Measures of Chemical Similarity , 1998, J. Chem. Inf. Comput. Sci..

[18]  Michael J. Keiser,et al.  Relating protein pharmacology by ligand chemistry , 2007, Nature Biotechnology.

[19]  J. Irwin,et al.  ZINC ? A Free Database of Commercially Available Compounds for Virtual Screening. , 2005 .

[20]  J. Angus The Asymptotic Theory of Extreme Order Statistics , 1990 .

[21]  Andreas Bender,et al.  Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT 2D): Evaluation of Performance , 2004, J. Chem. Inf. Model..

[22]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[23]  Andrej Blejec,et al.  distribution of the ratio of jointly normal variables , 2004, Advances in Methodology and Statistics.

[24]  P Willett,et al.  Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. , 2002, Combinatorial chemistry & high throughput screening.

[25]  John A. Flueck,et al.  Distribution of a Ratio of Correlated Gamma Random Variables , 1979 .

[26]  T. Pham-Gia,et al.  Density of the Ratio of Two Normal Random Variables and Applications , 2006 .

[27]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[28]  Hugo A. Loáiciga,et al.  Correlated gamma variables in the analysis of microbial densities in water , 2005 .

[29]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[30]  D. Hinkley On the ratio of two correlated normal random variables , 1969 .

[31]  Jürgen Bajorath,et al.  Profile Scaling Increases the Similarity Search Performance of Molecular Fingerprints Containing Numerical Descriptors and Structural Keys , 2003, J. Chem. Inf. Comput. Sci..

[32]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[33]  Eugene A. Coats,et al.  The CoMFA Steroids as a Benchmark Dataset for Development of 3D QSAR Methods , 1998 .

[34]  Pierre Baldi,et al.  Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval , 2007, J. Chem. Inf. Model..

[35]  Pierre Baldi,et al.  ChemDB update - full-text search and virtual chemical space , 2007, Bioinform..

[36]  Pierre Baldi,et al.  Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time , 2007, J. Chem. Inf. Model..