The influence of the inactives subset generation on the performance of machine learning methods

BackgroundA growing popularity of machine learning methods application in virtual screening, in both classification and regression tasks, can be observed in the past few years. However, their effectiveness is strongly dependent on many different factors.ResultsIn this study, the influence of the way of forming the set of inactives on the classification process was examined: random and diverse selection from the ZINC database, MDDR database and libraries generated according to the DUD methodology. All learning methods were tested in two modes: using one test set, the same for each method of inactive molecules generation and using test sets with inactives prepared in an analogous way as for training. The experiments were carried out for 5 different protein targets, 3 fingerprints for molecules representation and 7 classification algorithms with varying parameters. It appeared that the process of inactive set formation had a substantial impact on the machine learning methods performance.ConclusionsThe level of chemical space limitation determined the ability of tested classifiers to select potentially active molecules in virtual screening tasks, as for example DUDs (widely applied in docking experiments) did not provide proper selection of active molecules from databases with diverse structures. The study clearly showed that inactive compounds forming training set should be representative to the highest possible extent for libraries that undergo screening.

[1]  Jia Jia,et al.  Comparative analysis of machine learning methods in ligand-based virtual screening of large compound libraries. , 2009, Combinatorial chemistry & high throughput screening.

[3]  Anthony Nicholls,et al.  What do we know and when do we know it? , 2008, J. Comput. Aided Mol. Des..

[4]  Evan Bolton,et al.  PubChem's BioAssay Database , 2011, Nucleic Acids Res..

[5]  George Papadatos,et al.  Evaluation of machine-learning methods for ligand-based virtual screening , 2007, J. Comput. Aided Mol. Des..

[6]  J. Irwin,et al.  Benchmarking sets for molecular docking. , 2006, Journal of medicinal chemistry.

[7]  P. Willett,et al.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. , 2004, Organic & biomolecular chemistry.

[8]  Christoph Helma,et al.  Classification of cytochrome p(450) activities using machine learning methods. , 2009, Molecular pharmaceutics.

[9]  Dariusz Plewczynski,et al.  Target specific compound identification using a support vector machine. , 2007, Combinatorial chemistry & high throughput screening.

[10]  Frederick P. Roth,et al.  Chemical substructures that enrich for biological activity , 2008, Bioinform..

[11]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[12]  Z. R. Li,et al.  Prediction of estrogen receptor agonists and characterization of associated molecular descriptors by statistical learning methods. , 2006, Journal of molecular graphics & modelling.

[13]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[14]  Hanna Geppert,et al.  Current Trends in Ligand-Based Virtual Screening: Molecular Representations, Data Mining Methods, New Application Areas, and Performance Evaluation , 2010, J. Chem. Inf. Model..

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[17]  Brian K. Shoichet,et al.  ZINC - A Free Database of Commercially Available Compounds for Virtual Screening , 2005, J. Chem. Inf. Model..

[18]  Gilles Blanchard,et al.  How wrong can we get? A review of machine learning approaches and error bars. , 2009, Combinatorial chemistry & high throughput screening.

[19]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[20]  Raymond J. Mooney,et al.  Constructing Diverse Classifier Ensembles using Artificial Training Examples , 2003, IJCAI.

[21]  Dariusz Plewczynski,et al.  Brainstorming: weighted voting prediction of inhibitors for protein targets , 2010, Journal of molecular modeling.

[22]  Zongyu Geng,et al.  Randomized Decimation HyperPipes , 2010 .

[23]  X. H. Liu,et al.  Virtual Screening of Abl Inhibitors from Large Compound Libraries by Support Vector Machines , 2009, J. Chem. Inf. Model..

[24]  Dariusz Plewczynski,et al.  Virtual high throughput screening using combined random forest and flexible docking. , 2009, Combinatorial chemistry & high throughput screening.

[25]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[26]  Z. R. Li,et al.  A support vector machines approach for virtual screening of active compounds of single and multiple mechanisms from large libraries at an improved hit-rate and enrichment factor. , 2008, Journal of molecular graphics & modelling.

[27]  Piero Fariselli,et al.  Prediction of the Bonding State of Cysteine Residues in Proteins with Machine-Learning Methods , 2010, CIBB.

[28]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[29]  Jerzy Stefanowski,et al.  Comparing Performance of Committee Based Approaches to Active Learning , 2009 .

[30]  Jérôme Hert,et al.  Turbo similarity searching: Effect of fingerprint and dataset on virtual‐screening performance , 2009, Stat. Anal. Data Min..

[31]  Miklos Feher,et al.  Novel 2D Fingerprints for Ligand-Based Virtual Screening , 2006, J. Chem. Inf. Model..

[32]  Jonathan D Hirst,et al.  Machine learning in virtual screening. , 2009, Combinatorial chemistry & high throughput screening.

[33]  Dariusz Plewczynski,et al.  Assessing Different Classification Methods for Virtual Screening , 2006, J. Chem. Inf. Model..