Introduction of a Generally Applicable Method to Estimate Retrieval of Active Molecules for Similarity Searching using Fingerprints

Fingerprints are bit string representations of molecular structure and properties and are among the most widely used computational tools for similarity searching and database screening. Various fingerprint designs are available and their search performance is in general strongly dependent on the compound classes under study and the chemical characteristics of screening databases. Currently, it is not possible to predict the probability of identifying novel hits through fingerprint searching. However, for practical applications, such estimations would be very useful because one might be able, for example, to prioritize fingerprints and compound selection strategies or decide whether or not a similarity search campaign with subsequent experimental evaluation of candidate compounds would be promising at all. We have developed a method that makes it possible to predict the outcome of similarity search calculations using any type of keyed fingerprint. The methodology incorporates bit frequency distributions of reference molecules and the screening database into an information‐theoretic function and determines the principally possible recall of active compounds within selection sets of varying size. We calibrate the function on diverse compound classes and accurately predict compound recovery in retrospective virtual screening trials. Furthermore, we correctly predict fingerprint search performance on two experimental high‐throughput screening data sets (HTS). Our findings indicate that given a set of reference molecules, a fingerprint, and a screening database, we can readily estimate how likely it will be to retrieve active compounds, without knowledge about the distribution of potential hits in the database.

[1]  Peter Willett,et al.  Comparison of fragment weighting schemes for substructural analysis , 1989 .

[2]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[3]  Robert P. Sheridan,et al.  Chemical Similarity Using Geometric Atom Pair Descriptors , 1996, J. Chem. Inf. Comput. Sci..

[4]  John M. Barnard,et al.  Chemical Fragment Generation and Clustering Software , 1997, J. Chem. Inf. Comput. Sci..

[5]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[6]  J. Mason,et al.  New 4-point pharmacophore method for molecular similarity and diversity applications: overview of the method and applications, including a novel approach to the design of combinatorial libraries containing privileged substructures. , 1999, Journal of medicinal chemistry.

[7]  P. Beroza,et al.  A rapid computational method for lead evolution: description and application to alpha(1)-adrenergic antagonists. , 2000, Journal of medicinal chemistry.

[8]  X Chen,et al.  BindingDB: a web-accessible molecular recognition database. , 2001, Combinatorial chemistry & high throughput screening.

[9]  Robert P Sheridan,et al.  Why do we need so many chemical similarity search methods? , 2002, Drug discovery today.

[10]  Jürgen Bajorath,et al.  Integration of virtual and high-throughput screening , 2002, Nature Reviews Drug Discovery.

[11]  James G. Nourse,et al.  Reoptimization of MDL Keys for Use in Drug Discovery , 2002, J. Chem. Inf. Comput. Sci..

[12]  Jürgen Bajorath,et al.  Accurate Partitioning of Compounds Belonging to Diverse Activity Classes , 2002, J. Chem. Inf. Comput. Sci..

[13]  Jürgen Bajorath,et al.  Design and Evaluation of a Molecular Fingerprint Involving the Transformation of Property Descriptor Values into a Binary Classification Scheme , 2003, J. Chem. Inf. Comput. Sci..

[14]  E. Brown,et al.  High throughput screening identifies novel inhibitors of Escherichia coli dihydrofolate reductase that are competitive with dihydrofolate. , 2003, Bioorganic & medicinal chemistry letters.

[15]  Pierre Acklin,et al.  Similarity Metrics for Ligands Reflecting the Similarity of the Target Proteins , 2003, J. Chem. Inf. Comput. Sci..

[16]  Jürgen Bajorath,et al.  Profile Scaling Increases the Similarity Search Performance of Molecular Fingerprints Containing Numerical Descriptors and Structural Keys , 2003, J. Chem. Inf. Comput. Sci..

[17]  Andreas Bender,et al.  Molecular Similarity Searching Using Atom Environments, Information-Based Feature Selection, and a Naïve Bayesian Classifier , 2004, J. Chem. Inf. Model..

[18]  Andreas Bender,et al.  Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT 2D): Evaluation of Performance , 2004, J. Chem. Inf. Model..

[19]  Jérôme Hert,et al.  Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures , 2004, J. Chem. Inf. Model..

[20]  Brian K. Shoichet,et al.  ZINC - A Free Database of Commercially Available Compounds for Virtual Screening , 2005, J. Chem. Inf. Model..

[21]  P. Willett Searching techniques for databases of two- and three-dimensional chemical structures. , 2005, Journal of medicinal chemistry.

[22]  Jürgen Bajorath,et al.  Anatomy of Fingerprint Search Calculations on Structurally Diverse Sets of Active Compounds , 2005, J. Chem. Inf. Model..

[23]  Nadine H. Elowe,et al.  Experimental Screening of Dihydrofolate Reductase Yields a “Test Set” of 50,000 Small Molecules for a Computational Data-Mining and Docking Competition , 2005, Journal of biomolecular screening.

[24]  Jürgen Bajorath,et al.  A Distance Function for Retrieval of Active Molecules from Complex Chemical Space Representations , 2006, J. Chem. Inf. Model..

[25]  Peter Willett,et al.  Similarity-based virtual screening using 2D fingerprints. , 2006, Drug discovery today.

[26]  Jürgen Bajorath,et al.  Design and Evaluation of a Novel Class-Directed 2D Fingerprint to Search for Structurally Diverse Active Compounds , 2006, J. Chem. Inf. Model..

[27]  Jérôme Hert,et al.  New Methods for Ligand-Based Virtual Screening: Use of Data Fusion and Machine Learning to Enhance the Effectiveness of Similarity Searching , 2006, J. Chem. Inf. Model..

[28]  Jürgen Bajorath,et al.  Introduction of an Information-Theoretic Method to Predict Recovery Rates of Active Compounds for Bayesian in Silico Screening: Theory and Screening Trials , 2007, J. Chem. Inf. Model..

[29]  Gabriele Cruciani,et al.  A Common Reference Framework for Analyzing/Comparing Proteins and Ligands. Fingerprints for Ligands And Proteins (FLAP): Theory and Application , 2007, J. Chem. Inf. Model..

[30]  Jürgen Bajorath,et al.  Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. , 2007, Drug discovery today.

[31]  Jürgen Bajorath,et al.  Bayesian Interpretation of a Distance Function for Navigating High-Dimensional Descriptor Spaces , 2007, J. Chem. Inf. Model..