Filtered circular fingerprints improve either prediction or runtime performance while retaining interpretability

AbstractBackgroundEven though circular fingerprints have been first introduced more than 50 years ago, they are still widely used for building highly predictive, state-of-the-art (Q)SAR models. Historically, these structural fragments were designed to search large molecular databases. Hence, to derive a compact representation, circular fingerprint fragments are often folded to comparatively short bit-strings. However, folding fingerprints introduces bit collisions, and therefore adds noise to the encoded structural information and removes its interpretability. Both representations, folded as well as unprocessed fingerprints, are often used for (Q)SAR modeling.ResultsWe show that it can be preferable to build (Q)SAR models with circular fingerprint fragments that have been filtered by supervised feature selection, instead of applying folded or all fragments. Compared to folded fingerprints, filtered fingerprints significantly increase predictive performance and remain unambiguous and interpretable. Compared to unprocessed fingerprints, filtered fingerprints reduce the computational effort and are a more compact and less redundant feature representation. Depending on the selected learning algorithm filtering yields about equally predictive (Q)SAR models. We demonstrate the suitability of filtered fingerprints for (Q)SAR modeling by presenting our freely available web service Collision-free Filtered Circular Fingerprints that provides rationales for predictions by highlighting important structural features in the query compound (see http://coffer.informatik.uni-mainz.de).ConclusionsCircular fingerprints are potent structural features that yield highly predictive models and encode interpretable structural information. However, to not lose interpretability, circular fingerprints should not be folded when building prediction models. Our experiments show that filtering is a suitable option to reduce the high computational effort when working with all fingerprint fragments. Additionally, our experiments suggest that the area under precision recall curve is a more sensible statistic for validating (Q)SAR models for virtual screening than the area under ROC or other measures for early recognition. Graphical Abstract

[1]  Christoph Helma,et al.  Lazy structure-activity relationships (lazar) for the prediction of rodent carcinogenicity and Salmonella mutagenicity , 2006, Molecular Diversity.

[2]  Ola Spjuth,et al.  Ligand-Based Target Prediction with Signature Fingerprints , 2014, J. Chem. Inf. Model..

[3]  Stefan Kramer,et al.  Large-scale graph mining using backbone refinement classes , 2009, KDD.

[4]  Sereina Riniker,et al.  Heterogeneous Classifier Fusion for Ligand-Based Virtual Screening: Or, How Decision Making by Committee Can Be a Good Thing , 2013, J. Chem. Inf. Model..

[5]  David Page,et al.  Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals , 2013, ECML/PKDD.

[6]  Ruifeng Liu,et al.  Merging Applicability Domains for in Silico Assessment of Chemical Mutagenicity , 2014, J. Chem. Inf. Model..

[7]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Kathrin Heikamp,et al.  Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets , 2011, J. Chem. Inf. Model..

[10]  Naomie Salim,et al.  Condorcet and borda count fusion method for ligand-based virtual screening , 2014, Journal of Cheminformatics.

[11]  J. Kazius,et al.  Derivation and validation of toxicophores for mutagenicity prediction. , 2005, Journal of medicinal chemistry.

[12]  Sebastian G. Rohrer,et al.  Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening Based on PubChem Bioactivity Data , 2009, J. Chem. Inf. Model..

[13]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[14]  J. Dearden,et al.  QSAR modeling: where have you been? Where are you going to? , 2014, Journal of medicinal chemistry.

[15]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[16]  Xiaoyang Xia,et al.  Classification of kinase inhibitors using a Bayesian model. , 2004, Journal of medicinal chemistry.

[17]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[18]  Nina Nikolova-Jeliazkova,et al.  QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review , 2005, Alternatives to laboratory animals : ATLA.

[19]  P. Willett,et al.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. , 2004, Organic & biomolecular chemistry.

[20]  J. Irwin,et al.  Benchmarking sets for molecular docking. , 2006, Journal of medicinal chemistry.

[21]  Q Xie,et al.  Structure-activity relationships for a large diverse set of natural, synthetic, and environmental estrogens. , 2001, Chemical research in toxicology.

[22]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[23]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[24]  Anthony Nicholls,et al.  What do we know and when do we know it? , 2008, J. Comput. Aided Mol. Des..

[25]  Emilio Benfenati,et al.  New public QSAR model for carcinogenicity , 2010, Chemistry Central journal.

[26]  Sereina Riniker,et al.  Open-source platform to benchmark fingerprints for ligand-based virtual screening , 2013, Journal of Cheminformatics.

[27]  Aixia Yan,et al.  Support Vector Machine (SVM) Models for Predicting Inhibitors of the 3′ Processing Step of HIV‐1 Integrase , 2013, Molecular informatics.

[28]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[29]  Andreas Bender,et al.  Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT 2D): Evaluation of Performance , 2004, J. Chem. Inf. Model..

[30]  Ola Spjuth,et al.  Interpretation of Conformal Prediction Classification Models , 2015, SLDS.

[31]  Stefan Kramer,et al.  CheS-Mapper - Chemical Space Mapping and Visualization in 3D , 2012, Journal of Cheminformatics.

[32]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[33]  Mark Craven,et al.  Markov Networks for Detecting Overalpping Elements in Sequence Data , 2004, NIPS.

[34]  Luc De Raedt,et al.  SMIREP: Predicting Chemical Activity from SMILES. , 2007 .

[35]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[36]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[37]  Qingsong Xu,et al.  Computer‐aided prediction of toxicity with substructure pattern and random forest , 2012 .

[38]  Jürgen Bajorath,et al.  Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. , 2007, Drug discovery today.

[39]  Matthieu Montes,et al.  Predictiveness curves in virtual screening , 2015, Journal of Cheminformatics.

[40]  D. Rogers,et al.  Using Extended-Connectivity Fingerprints with Laplacian-Modified Bayesian Analysis in High-Throughput Screening Follow-Up , 2005, Journal of biomolecular screening.

[41]  L. Gold,et al.  Supplement to the Carcinogenic Potency Database (CPDB): results of animal bioassays published in the general literature in 1993 to 1994 and by the National Toxicology Program in 1995 to 1996. , 1999, Environmental health perspectives.

[42]  Andreas Zell,et al.  Interpreting linear support vector machine models with heat map molecule coloring , 2011, J. Cheminformatics.

[43]  Yuan Wang,et al.  Using Information from Historical High-Throughput Screens to Predict Active Compounds , 2014, J. Chem. Inf. Model..

[44]  Knut Baumann,et al.  Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation , 2014, Journal of Cheminformatics.

[45]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[46]  Christopher I. Bayly,et al.  Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem , 2007, J. Chem. Inf. Model..

[47]  Pantelis Sopasakis,et al.  Collaborative development of predictive toxicology applications , 2010, J. Cheminformatics.

[48]  Eugen Lounkine,et al.  Improving the Search Performance of Extended Connectivity Fingerprints through Activity‐Oriented Feature Filtering and Application of a Bit‐Density‐Dependent Similarity Function , 2009, ChemMedChem.

[49]  David Page,et al.  Area under the Precision-Recall Curve: Point Estimates and Confidence Intervals , 2013, ECML/PKDD.

[50]  Stefan Kramer,et al.  A Large‐Scale Empirical Evaluation of Cross‐Validation and External Test Set Validation in (Q)SAR , 2013, Molecular informatics.