Chemical Database Mining through Entropy-Based Molecular Similarity Assessment of Randomly Generated Structural Fragment Populations

We describe a novel approach to search for active compounds that is based on the generation of random molecular fragment populations. As a similarity-based methodology, fragment profiling does not depend on the use of predefined descriptors of molecular structure and properties and the design of chemical space representations. To adapt the generation and comparison of random fragment populations for large-scale compound screening, we compare different fragmentation schemes, introduce the concept of compound class-specific fragment frequencies, and develop a novel entropic similarity metric for compound ranking. The approach has been extensively tested on 15 different compound activity classes with varying degrees of intraclass structural diversity and produced promising results in these calculations, comparable to similarity searching using fingerprints. A key feature of fragment profile searching is that the calculation of compound class-specific proportional Shannon entropy of random fragment distributions enables the identification of database molecules that share a significant number of signature substructures with known active compounds.