Bayesian Similarity Searching in High-Dimensional Descriptor Spaces Combined with Kullback-Leibler Descriptor Divergence Analysis

We investigate an approach that combines Bayesian modeling of probability distributions of descriptor values of active and database molecules with Kullback-Leibler analysis of the divergence between these distributions. The methodology is used for Bayesian screening and also to predict compound recall rates. In our study, we analyze two fundamental approximations underlying the Bayesian screening approach: the assumption that descriptors are independent of each other and, furthermore, that their data set values follow normal distributions. In addition, we calculate Kullback-Leibler divergence for single descriptors, rather than multiple-feature distributions, in order to prioritize descriptors for screening calculations. The results show that descriptor correlation effects, violating the assumption of feature independence, can lead to notable reduction of compound recall in Bayesian screening. Controlling descriptor correlation effects play a much more significant role for achieving high recall rates than approximating descriptor distributions by Gaussians. Furthermore, Kullback-Leibler divergence analysis is shown to systematically identify descriptors that are the most relevant for the outcome of Bayesian screening calculations.

[1]  Martyn G. Ford,et al.  Unsupervised Forward Selection: A Method for Eliminating Redundant Variables , 2000, J. Chem. Inf. Comput. Sci..

[2]  Jürgen Bajorath,et al.  Introduction of a Generally Applicable Method to Estimate Retrieval of Active Molecules for Similarity Searching using Fingerprints , 2007, ChemMedChem.

[3]  Jürgen Bajorath,et al.  Bayesian Interpretation of a Distance Function for Navigating High-Dimensional Descriptor Spaces. , 2007 .

[4]  J. Irwin,et al.  ZINC ? A Free Database of Commercially Available Compounds for Virtual Screening. , 2005 .

[5]  Paul Labute,et al.  Derivation and applications of molecular descriptors based on approximate surface area. , 2004, Methods in molecular biology.

[6]  Jürgen Bajorath,et al.  Mapping Algorithms for Molecular Similarity Analysis and Ligand-Based Virtual Screening: Design of DynaMAD and Comparison with MAD and DMC , 2006, J. Chem. Inf. Model..

[7]  K. M. Smith,et al.  Novel software tools for chemical diversity , 1998 .

[8]  Jürgen Bajorath,et al.  A Distance Function for Retrieval of Active Molecules from Complex Chemical Space Representations , 2006, J. Chem. Inf. Model..

[9]  Michael K. Gilson,et al.  Virtual Screening of Molecular Databases Using a Support Vector Machine , 2005, J. Chem. Inf. Model..

[10]  Gunnar Rätsch,et al.  Active Learning with Support Vector Machines in the Drug Discovery Process , 2003, J. Chem. Inf. Comput. Sci..

[11]  Jürgen Bajorath,et al.  Introduction of an Information-Theoretic Method to Predict Recovery Rates of Active Compounds for Bayesian in Silico Screening: Theory and Screening Trials , 2007, J. Chem. Inf. Model..

[12]  Chun-Nan Hsu,et al.  Why Discretization Works for Naive Bayesian Classifiers , 2000, ICML.

[13]  Jürgen Bajorath,et al.  Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. , 2007, Drug discovery today.