Reoptimization of MDL Keys for Use in Drug Discovery

For a number of years MDL products have exposed both 166 bit and 960 bit keysets based on 2D descriptors. These keysets were originally constructed and optimized for substructure searching. We report on improvements in the performance of MDL keysets which are reoptimized for use in molecular similarity. Classification performance for a test data set of 957 compounds was increased from 0.65 for the 166 bit keyset and 0.67 for the 960 bit keyset to 0.71 for a surprisal S/N pruned keyset containing 208 bits and 0.71 for a genetic algorithm optimized keyset containing 548 bits. We present an overview of the underlying technology supporting the definition of descriptors and the encoding of these descriptors into keysets. This technology allows definition of descriptors as combinations of atom properties, bond properties, and atomic neighborhoods at various topological separations as well as supporting a number of custom descriptors. These descriptors can then be used to set one or more bits in a keyset. We constructed various keysets and optimized their performance in clustering bioactive substances. Performance was measured using methodology developed by Briem and Lessel. "Directed pruning" was carried out by eliminating bits from the keysets on the basis of random selection, values of the surprisal of the bit, or values of the surprisal S/N ratio of the bit. The random pruning experiment highlighted the insensitivity of keyset performance for keyset lengths of more than 1000 bits. Contrary to initial expectations, pruning on the basis of the surprisal values of the various bits resulted in keysets which underperformed those resulting from random pruning. In contrast, pruning on the basis of the surprisal S/N ratio was found to yield keysets which performed better than those resulting from random pruning. We also explored the use of genetic algorithms in the selection of optimal keysets. Once more the performance was only a weak function of keyset size, and the optimizations failed to identify a single globally optimal keyset. Instead multiple, equally optimal keysets could be produced which had relatively low overlap of the descriptors they encoded.

[1]  J. Mason,et al.  Diversity assessment. , 1999, Current opinion in chemical biology.

[2]  Tudor I. Oprea,et al.  The Design of Leadlike Combinatorial Libraries. , 1999, Angewandte Chemie.

[3]  M Z Nagy,et al.  Substructure Search on Very Large Files Using Tree-Structured Databases , 1988 .

[4]  Thomas Henkel,et al.  Statistical Investigation into the Structural Complementarity of Natural Products and Synthetic Compounds. , 1999, Angewandte Chemie.

[5]  Malcolm J. McGregor,et al.  Clustering of Large Databases of Compounds: Using the MDL "Keys" as Structural Descriptors , 1997, J. Chem. Inf. Comput. Sci..

[6]  Hans Briem,et al.  Flexsim-X: A Method for the Detection of Molecules with Similar Biological Activity , 2000, J. Chem. Inf. Comput. Sci..

[7]  Jürgen Bajorath,et al.  Fingerprint Scaling Increases the Probability of Identifying Molecules with Similar Activity in Virtual Screening Calculations , 2001, J. Chem. Inf. Comput. Sci..

[8]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[9]  Erick K F Ahrens Customisation for Chemical Database Applications , 1988 .

[10]  James G. Nourse,et al.  Structure searching in chemical databases by direct lookup methods , 1993, J. Chem. Inf. Comput. Sci..

[11]  R. Levine,et al.  Molecular Reaction Dynamics and Chemical Reactivity , 1987 .

[12]  Robert D Clark,et al.  Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. , 1996, Journal of medicinal chemistry.

[13]  Peter Willett,et al.  Bit-String Methods for Selective Compound Acquisition , 2000, J. Chem. Inf. Comput. Sci..

[14]  I. Kuntz,et al.  Molecular similarity based on DOCK-generated fingerprints. , 1996, Journal of medicinal chemistry.

[15]  Jürgen Bajorath,et al.  Mini-fingerprints Detect Similar Activity of Receptor Ligands Previously Recognized Only by Three-Dimensional Pharmacophore-Based Methods , 2001, J. Chem. Inf. Comput. Sci..

[16]  Peter Willett,et al.  Rapid Quantification of Molecular Diversity for Selective Database Acquisition , 1997, J. Chem. Inf. Comput. Sci..

[17]  Johnz Willett Similarity and Clustering in Chemical Information Systems , 1987 .

[18]  David M. Rocke,et al.  Predicting ligand binding to proteins by affinity fingerprinting. , 1995, Chemistry & biology.

[19]  Matthias Rarey,et al.  Feature trees: A new molecular similarity measure based on tree matching , 1998, J. Comput. Aided Mol. Des..

[20]  Tudor I. Oprea,et al.  Is There a Difference between Leads and Drugs? A Historical Perspective , 2001, J. Chem. Inf. Comput. Sci..

[21]  David Weininger,et al.  Stigmata: An Algorithm To Determine Structural Commonalities in Diverse Datasets , 1996, J. Chem. Inf. Comput. Sci..

[22]  H. Matter,et al.  Selecting optimally diverse compounds from structure databases: a validation study of two-dimensional and three-dimensional molecular descriptors. , 1997, Journal of medicinal chemistry.

[23]  Y. Martin,et al.  Designing combinatorial library mixtures using a genetic algorithm. , 1997, Journal of medicinal chemistry.

[24]  U. Lessel,et al.  In vitro and in silico affinity fingerprints: Finding similarities beyond structural classes , 2000 .

[25]  G. S. Johnson,et al.  An Information-Intensive Approach to the Molecular Pharmacology of Cancer , 1997, Science.

[26]  James G. Nourse,et al.  The substance module: the representation, storage, and searching of complex structures , 1991, J. Chem. Inf. Comput. Sci..