Large scale study of multiple-molecule queries

BackgroundIn ligand-based screening, as well as in other chemoinformatics applications, one seeks to effectively search large repositories of molecules in order to retrieve molecules that are similar typically to a single molecule lead. However, in some case, multiple molecules from the same family are available to seed the query and search for other members of the same family.Multiple-molecule query methods have been less studied than single-molecule query methods. Furthermore, the previous studies have relied on proprietary data and sometimes have not used proper cross-validation methods to assess the results. In contrast, here we develop and compare multiple-molecule query methods using several large publicly available data sets and background. We also create a framework based on a strict cross-validation protocol to allow unbiased benchmarking for direct comparison in future studies across several performance metrics.ResultsFourteen different multiple-molecule query methods were defined and benchmarked using: (1) 41 publicly available data sets of related molecules with similar biological activity; and (2) publicly available background data sets consisting of up to 175,000 molecules randomly extracted from the ChemDB database and other sources. Eight of the fourteen methods were parameter free, and six of them fit one or two free parameters to the data using a careful cross-validation protocol. All the methods were assessed and compared for their ability to retrieve members of the same family against the background data set by using several performance metrics including the Area Under the Accumulation Curve (AUAC), Area Under the Curve (AUC), F1-measure, and BEDROC metrics.Consistent with the previous literature, the best parameter-free methods are the MAX-SIM and MIN-RANK methods, which score a molecule to a family by the maximum similarity, or minimum ranking, obtained across the family. One new parameterized method introduced in this study and two previously defined methods, the Exponential Tanimoto Discriminant (ETD), the Tanimoto Power Discriminant (TPD), and the Binary Kernel Discriminant (BKD), outperform most other methods but are more complex, requiring one or two parameters to be fit to the data.ConclusionFourteen methods for multiple-molecule querying of chemical databases, including novel methods, (ETD) and (TPD), are validated using publicly available data sets, standard cross-validation protocols, and established metrics. The best results are obtained with ETD, TPD, BKD, MAX-SIM, and MIN-RANK. These results can be replicated and compared with the results of future studies using data freely downloadable from http://cdb.ics.uci.edu/.

[1]  Rahul Singh,et al.  Reasoning about molecular similarity and properties , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[2]  Darren V. S. Green,et al.  Prediction of Biological Activity for High-Throughput Screening Using Binary Kernel Discrimination , 2001, J. Chem. Inf. Comput. Sci..

[3]  Darren R. Flower,et al.  On the Properties of Bit String-Based Measures of Chemical Similarity , 1998, J. Chem. Inf. Comput. Sci..

[4]  Yanqing Zhang,et al.  Granular Kernel Trees with parallel Genetic Algorithms for drug activity comparisons , 2007, Int. J. Data Min. Bioinform..

[5]  Pierre Baldi,et al.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity , 2005, ISMB.

[6]  Kitsuchart Pasupa,et al.  Virtual Screening Using Binary Kernel Discrimination: Effect of Noisy Training Data and the Optimization of Performance , 2006, J. Chem. Inf. Model..

[7]  P Willett,et al.  Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. , 2002, Combinatorial chemistry & high throughput screening.

[8]  Jérôme Hert,et al.  Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures , 2004, J. Chem. Inf. Model..

[9]  Gisbert Schneider,et al.  Kernel Approach to Molecular Similarity Based on Iterative Graph Similarity , 2007, J. Chem. Inf. Model..

[10]  Christopher I. Bayly,et al.  Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem , 2007, J. Chem. Inf. Model..

[11]  Anthony E. Klon,et al.  Combination of a naive Bayes classifier with consensus scoring improves enrichment of high-throughput docking results. , 2004, Journal of medicinal chemistry.

[12]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.

[13]  Sebastian G. Rohrer,et al.  Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity Data , 2009, J. Chem. Inf. Model..

[14]  Yvonne C. Martin,et al.  Application of Belief Theory to Similarity Data Fusion for Use in Analog Searching and Lead Hopping , 2008, J. Chem. Inf. Model..

[15]  P. Willett,et al.  Enhancing the effectiveness of similarity-based virtual screening using nearest-neighbor information. , 2005, Journal of medicinal chemistry.

[16]  Anthony E. Klon,et al.  Finding more needles in the haystack: A simple and efficient method for improving high-throughput docking results. , 2004, Journal of medicinal chemistry.

[17]  Daylight Theory Manual , 2011 .

[18]  Meir Glick,et al.  Enrichment of Extremely Noisy High-Throughput Screening Data Using a Naïve Bayes Classifier , 2004, Journal of biomolecular screening.

[19]  Jürgen Bajorath,et al.  Profile Scaling Increases the Similarity Search Performance of Molecular Fingerprints Containing Numerical Descriptors and Structural Keys , 2003, J. Chem. Inf. Comput. Sci..

[20]  Robert D. Clark,et al.  Managing bias in ROC curves , 2008, J. Comput. Aided Mol. Des..

[21]  Pierre Baldi,et al.  Speeding Up Chemical Database Searches Using a Proximity Filter Based on the Logical Exclusive OR , 2008, J. Chem. Inf. Model..

[22]  Jérôme Hert,et al.  New Methods for Ligand-Based Virtual Screening: Use of Data Fusion and Machine Learning to Enhance the Effectiveness of Similarity Searching , 2006, J. Chem. Inf. Model..

[23]  Jeffrey J. Sutherland,et al.  Spline-Fitting with a Genetic Algorithm: A Method for Developing Classification Structure-Activity Relationships , 2003, J. Chem. Inf. Comput. Sci..

[24]  Rahul Singh,et al.  Surface similarity-based molecular query-retrieval , 2007, BMC Cell Biology.

[25]  Pierre Baldi,et al.  Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time , 2007, J. Chem. Inf. Model..

[26]  Pierre Baldi,et al.  Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval , 2007, J. Chem. Inf. Model..

[27]  M Rarey,et al.  Detailed analysis of scoring functions for virtual screening. , 2001, Journal of medicinal chemistry.

[28]  Tudor I. Oprea,et al.  Is There a Difference between Leads and Drugs? A Historical Perspective , 2001, J. Chem. Inf. Comput. Sci..

[29]  Pierre Baldi,et al.  Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time. , 2007 .

[30]  Joseph S. Verducci,et al.  A Modification of the Jaccard–Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings , 2002, Technometrics.

[31]  Jürgen Bajorath,et al.  Similarity Search Profiling Reveals Effects of Fingerprint Scaling in Virtual Screening , 2004, J. Chem. Inf. Model..