Validated Descriptors for Diversity Measurements and Optimization

The new strategies of high throughput screening, combinatorial chemistry, and structure-activity relationships by NMR, demand that computational chemists be able to select from 105 to 107 compounds, a subset of diverse compounds for purchase, synthesis or testing. This must be accomplished within a reasonable time and with measurable accuracy. With these constraints in mind we compared a number of molecular descriptors and clustering methods using diverse sets of molecules with known biological activity. We found that bit strings that describe the presence or absence of 153 small generic and specific fragments outperform hashed fingerprints that include the nature of all substructures containing up to seven bonds. They also outperform distance keys used in commercial three-dimensional searching systems. These 153 descriptors contain the most information about hydrophobicity, pKa, size, shape, and hydrogen bonding properties of the molecules. Even better performance is obtained when distances between site points complementary to hydrogen bonding and charged groups combined with distances between centres of aromatic rings and attachment points for hydrophobic groups. Of the clustering methods considered, Ward's is most effective at separating active from inactive compounds.