Toward an Improved Clustering of Large Data Sets Using Maximum Common Substructures and Topological Fingerprints

A new clustering algorithm was developed that is able to group large data sets with more than 100,000 molecules according to their chemotypes. The algorithm preclusters a data set using a fingerprint version of the hierarchical k-means algorithm. Chemotypes are extracted from the terminal clusters via a maximum common substructure approach. Molecules forming a chemotype have to share a predefined number of rings, atoms, and non-carbon heavy atoms. In an iterative procedure, similar chemotypes and singletons are fused to larger chemotypes. Singletons that cannot be assigned to any chemotype are then grouped based on the proportion of overlap between the molecules. Representatives from each chemotype and the singletons are used in a second round of the hierarchical k-means algorithm to provide a final hierarchical grouping. Results are reported to an interactive graphical user interface which allows initial insights about the structure activity relationship (SAR) of the molecules. Example applications are shown for two chemotypes of reverse transcriptase inhibitors in the MDDR database and for the evaluation of descriptor-based similarity searching routines. A special focus was laid on the chemotype hopping potential of each individual routine. The algorithm will allow the analysis of high-throughput and virtual screening results with improved quality.