Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval

In many modern chemoinformatics systems, molecules are represented by long binary fingerprint vectors recording the presence or absence of particular features or substructures, such as labeled paths or trees, in the molecular graphs. These long fingerprints are often compressed to much shorter fingerprints using a simple modulo operation. As the length of the fingerprints decreases, their typical density and overlap tend to increase, and so does any similarity measure based on overlap, such as the widely used Tanimoto similarity. Here we show that this correlation between shorter fingerprints and higher similarity can be thought of as a systematic error introduced by the fingerprint folding algorithm and that this systematic error can be corrected mathematically. More precisely, given two molecules and their compressed fingerprints of a given length, we show how a better estimate of their uncompressed overlap, hence of their similarity, can be derived to correct for this bias. We show how the correction can be implemented not only for the Tanimoto measure but also for all other commonly used measures. Experiments on various data sets and fingerprint sizes demonstrate how, with a negligible computational overhead, the correction noticeably improves the sensitivity and specificity of chemical retrieval.

[1]  W. Guida,et al.  The art and practice of structure‐based drug design: A molecular modeling perspective , 1996, Medicinal research reviews.

[2]  Jürgen Bajorath,et al.  Profile Scaling Increases the Similarity Search Performance of Molecular Fingerprints Containing Numerical Descriptors and Structural Keys , 2003, J. Chem. Inf. Comput. Sci..

[3]  Pierre Baldi,et al.  Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time , 2007, J. Chem. Inf. Model..

[4]  P Willett,et al.  Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. , 2002, Combinatorial chemistry & high throughput screening.

[5]  Brian K. Shoichet,et al.  ZINC - A Free Database of Commercially Available Compounds for Virtual Screening , 2005, J. Chem. Inf. Model..

[6]  A. Tversky Features of Similarity , 1977 .

[7]  Béla Bollobás,et al.  Random Graphs , 1985 .

[8]  M Rarey,et al.  Detailed analysis of scoring functions for virtual screening. , 2001, Journal of medicinal chemistry.

[9]  Dennis H. Rouvray,et al.  Definition and role of similarity concepts in the chemical and physical sciences , 1992, J. Chem. Inf. Comput. Sci..

[10]  Pierre Baldi,et al.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity , 2005, ISMB.

[11]  Joseph S. Verducci,et al.  A Modification of the Jaccard–Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings , 2002, Technometrics.

[12]  Jürgen Bajorath,et al.  Similarity Search Profiling Reveals Effects of Fingerprint Scaling in Virtual Screening , 2004, J. Chem. Inf. Model..

[13]  Pierre Baldi,et al.  ChemDB: a public database of small molecules and related chemoinformatics resources , 2005, Bioinform..

[14]  Darren R. Flower,et al.  On the Properties of Bit String-Based Measures of Chemical Similarity , 1998, J. Chem. Inf. Comput. Sci..