Greedy and Linear Ensembles of Machine Learning Methods Outperform Single Approaches for QSPR Regression Problems

The application of Machine Learning to cheminformatics is a large and active field of research, but there exist few papers which discuss whether ensembles of different Machine Learning methods can improve upon the performance of their component methodologies. Here we investigated a variety of methods, including kernel‐based, tree, linear, neural networks, and both greedy and linear ensemble methods. These were all tested against a standardised methodology for regression with data relevant to the pharmaceutical development process. This investigation focused on QSPR problems within drug‐like chemical space. We aimed to investigate which methods perform best, and how the ‘wisdom of crowds’ principle can be applied to ensemble predictors. It was found that no single method performs best for all problems, but that a dynamic, well‐structured ensemble predictor would perform very well across the board, usually providing an improvement in performance over the best single method. Its use of weighting factors allows the greedy ensemble to acquire a bigger contribution from the better performing models, and this helps the greedy ensemble generally to outperform the simpler linear ensemble. Choice of data preprocessing methodology was found to be crucial to performance of each method too.

[1]  C. David Page,et al.  Can machine learning and combinatorial chemistry coexist? An antimicrobial peptide case study , 2002 .

[2]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[3]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[4]  Andreas Bender,et al.  Melting Point Prediction Employing k-Nearest Neighbor Algorithms and Genetic Parameter Optimization , 2006, J. Chem. Inf. Model..

[5]  Ying Liu,et al.  Active Learning with Support Vector Machine Applied to Gene Expression Data for Cancer Classification , 2004, J. Chem. Inf. Model..

[6]  C. Lipinski Lead- and drug-like compounds: the rule-of-five revolution. , 2004, Drug discovery today. Technologies.

[7]  William Stafford Noble,et al.  Support vector machine , 2013 .

[8]  Andrew J. Bulpitt,et al.  A Primer on Learning in Bayesian Networks for Computational Biology , 2007, PLoS Comput. Biol..

[9]  H. J. Mclaughlin,et al.  Learn , 2002 .

[10]  S. Planey,et al.  The influence of lipophilicity in drug discovery and design , 2012, Expert opinion on drug discovery.

[11]  Min Wang,et al.  Prediction of antibacterial compounds by machine learning approaches , 2009, J. Comput. Chem..

[12]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings , 1997 .

[13]  Hsiao-Tien Pao,et al.  A comparison of neural network and multiple regression analysis in modeling capital structure , 2008, Expert Syst. Appl..

[14]  John B. O. Mitchell Machine learning methods in chemoinformatics , 2014, Wiley interdisciplinary reviews. Computational molecular science.

[15]  Grigorios Tsoumakas,et al.  Greedy regression ensemble selection: Theory and an application to water quality prediction , 2008, Inf. Sci..

[16]  Meir Glick,et al.  Enrichment of High-Throughput Screening Data with Increasing Levels of Noise Using Support Vector Machines, Recursive Partitioning, and Laplacian-Modified Naive Bayesian Classifiers , 2006, J. Chem. Inf. Model..

[17]  Stephen Muggleton,et al.  Protein secondary structure prediction using logic-based machine learning , 1992 .

[18]  James Surowiecki The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and nations Doubleday Books. , 2004 .

[19]  Sunil S. Bhagwat,et al.  Prediction of Melting Points of Organic Compounds Using Extreme Learning Machines , 2008 .

[20]  H. Fühner Die Wasserlöslichkeit in homologen Reihen , 1924 .

[21]  Danielle S. Bassett,et al.  Learning, Memory, and the Role of Neural Network Architecture , 2011, PLoS Comput. Biol..

[22]  Jiansong Fang,et al.  Predictions of BuChE Inhibitors Using Support Vector Machine and Naive Bayesian Classification Techniques in Drug Discovery , 2013, J. Chem. Inf. Model..

[23]  Nathan J. Brown Algorithms for chemoinformatics , 2011 .

[24]  Jeffrey T. Walton Subpixel urban land cover estimation: comparing cubist, random forests, and support vector regression , 2008 .

[25]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[26]  Karl Box,et al.  New Ideas about the Solubility of Drugs , 2009, Chemistry & biodiversity.

[27]  Chartchalerm Isarankura-Na-Ayudhya,et al.  Prediction of GFP spectral properties using artificial neural network , 2007, J. Comput. Chem..

[28]  M. Gribaudo,et al.  2002 , 2001, Cell and Tissue Research.

[29]  T. O. Kvålseth Cautionary Note about R 2 , 1985 .

[30]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  Markus Helfert,et al.  The Impact of Information Quality on Quality of Life: An Information Quality Oriented Framework , 2013, IEICE Trans. Commun..

[33]  Pierre Baldi,et al.  Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules , 2013, J. Chem. Inf. Model..

[34]  K. Müller,et al.  Predicting Lipophilicity of Drug‐Discovery Molecules using Gaussian Process Models , 2007, ChemMedChem.

[35]  Peter Gedeck,et al.  QSAR - How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets , 2006, J. Chem. Inf. Model..

[36]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[37]  Tanja Van Mourik,et al.  Uniting Cheminformatics and Chemical Theory To Predict the Intrinsic Aqueous Solubility of Crystalline Druglike Molecules , 2014, J. Chem. Inf. Model..

[38]  Emilio Xavier Esposito,et al.  Findings of the Challenge To Predict Aqueous Solubility , 2009, J. Chem. Inf. Model..

[39]  Tomasz Arodz,et al.  Computational methods in developing quantitative structure-activity relationships (QSAR): a review. , 2006, Combinatorial chemistry & high throughput screening.

[40]  Florian Nigsch,et al.  Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR Models of Solubility, Melting Point, and Log P , 2008, J. Chem. Inf. Model..

[41]  John B. O. Mitchell,et al.  Is experimental data quality the limiting factor in predicting the aqueous solubility of druglike molecules? , 2014, Molecular pharmaceutics.

[42]  Gregg B. Fields,et al.  Peptides for the New Millennium , 2002, American Peptide Symposia.

[43]  Atsushi Imiya,et al.  Machine Learning and Data Mining in Pattern Recognition , 2013, Lecture Notes in Computer Science.

[44]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[45]  Muthukumarasamy Karthikeyan,et al.  General Melting Point Prediction Based on a Diverse Compound Data Set and Artificial Neural Networks , 2005, J. Chem. Inf. Model..

[46]  W L Jorgensen,et al.  Prediction of drug solubility from Monte Carlo simulations. , 2000, Bioorganic & medicinal chemistry letters.

[47]  J. Dearden,et al.  QSAR modeling: where have you been? Where are you going to? , 2014, Journal of medicinal chemistry.

[48]  Klaus-Robert Müller,et al.  Accurate Solubility Prediction with Error Bars for Electrolytes: A Machine Learning Approach , 2007, J. Chem. Inf. Model..

[49]  Robert C. Glen,et al.  Solubility Challenge: Can You Predict Solubilities of 32 Molecules Using a Database of 100 Reliable Measurements? , 2008, J. Chem. Inf. Model..

[50]  Samuel H. Yalkowsky,et al.  Prediction of Drug Solubility by the General Solubility Equation (GSE) , 2001, J. Chem. Inf. Comput. Sci..

[51]  Alexander Golbraikh,et al.  QSAR Modeling Using Chirality Descriptors Derived from Molecular Topology , 2003, J. Chem. Inf. Comput. Sci..

[52]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[53]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[54]  Jian-Hui Jiang,et al.  Support vector machine based training of multilayer feedforward neural networks as optimized by particle swarm algorithm: Application in QSAR studies of bioactivity of organic compounds , 2007, J. Comput. Chem..

[55]  Michael E. Tipping Sparse Bayesian Learning and the Relevance Vector Machine , 2001, J. Mach. Learn. Res..

[56]  Nathan Brown,et al.  Chemoinformatics—an introduction for computer scientists , 2009, CSUR.

[57]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[58]  José Augusto Baranauskas,et al.  How Many Trees in a Random Forest? , 2012, MLDM.

[59]  Edmund A. Mennis The Wisdom of Crowds: Why the Many Are Smarter than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations , 2006 .

[60]  TJHSST Senior,et al.  Greedy Algorithm , 2013 .

[61]  Kuo-Chen Chou,et al.  Support vector machines for predicting HIV protease cleavage sites in protein , 2002, J. Comput. Chem..