Bioactive Molecule Prediction Using Extreme Gradient Boosting

Following the explosive growth in chemical and biological data, the shift from traditional methods of drug discovery to computer-aided means has made data mining and machine learning methods integral parts of today’s drug discovery process. In this paper, extreme gradient boosting (Xgboost), which is an ensemble of Classification and Regression Tree (CART) and a variant of the Gradient Boosting Machine, was investigated for the prediction of biological activity based on quantitative description of the compound’s molecular structure. Seven datasets, well known in the literature were used in this paper and experimental results show that Xgboost can outperform machine learning algorithms like Random Forest (RF), Support Vector Machines (LSVM), Radial Basis Function Neural Network (RBFN) and Naïve Bayes (NB) for the prediction of biological activities. In addition to its ability to detect minority activity classes in highly imbalanced datasets, it showed remarkable performance on both high and low diversity datasets.

[1]  D. Wolpert The Supervised Learning No-Free-Lunch Theorems , 2002 .

[2]  Guido Bugmann,et al.  Normalized Gaussian Radial Basis Function networks , 1998, Neurocomputing.

[3]  I. Guyon,et al.  The Higgs Machine Learning Challenge , 2015 .

[4]  J. Irwin,et al.  Benchmarking sets for molecular docking. , 2006, Journal of medicinal chemistry.

[5]  Naomie Salim,et al.  Prediction of New Bioactive Molecules using a Bayesian Belief Network , 2014, J. Chem. Inf. Model..

[6]  Anne Mai Wassermann,et al.  Searching for Target-Selective Compounds Using Different Combinations of Multiclass Support Vector Machine Ranking Methods, Kernel Functions, and Fingerprint Descriptors , 2009, J. Chem. Inf. Model..

[7]  Darren V. S. Green,et al.  Prediction of Biological Activity for High-Throughput Screening Using Binary Kernel Discrimination , 2001, J. Chem. Inf. Comput. Sci..

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Andreas Bender,et al.  In Silico Target Predictions: Defining a Benchmarking Data Set and Comparison of Performance of the Multiclass Naïve Bayes and Parzen-Rosenblatt Window , 2013, J. Chem. Inf. Model..

[10]  Janez Bester,et al.  Introduction to the Artificial Neural Networks , 2011 .

[11]  Marvin Johnson,et al.  Concepts and applications of molecular similarity , 1990 .

[12]  Jérôme Hert,et al.  New Methods for Ligand-Based Virtual Screening: Use of Data Fusion and Machine Learning to Enhance the Effectiveness of Similarity Searching , 2006, J. Chem. Inf. Model..

[13]  Jeffrey J. Sutherland,et al.  Spline-Fitting with a Genetic Algorithm: A Method for Developing Classification Structure-Activity Relationships , 2003, J. Chem. Inf. Comput. Sci..

[14]  Naomie Salim,et al.  Ligand expansion in ligand-based virtual screening using relevance feedback , 2012, Journal of Computer-Aided Molecular Design.

[15]  Antonio Lavecchia,et al.  Machine-learning approaches in drug discovery: methods and applications. , 2015, Drug discovery today.

[16]  Luc De Raedt,et al.  Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructures and Structure Activity Relationships of Noncongeneric Compounds , 2004, J. Chem. Inf. Model..

[17]  Yvan Vander Heyden,et al.  Classification Tree Models for the Prediction of Blood-Brain Barrier Passage of Drugs , 2006, J. Chem. Inf. Model..

[18]  Gregory W. Kauffman,et al.  QSAR and k-Nearest Neighbor Classification Analysis of Selective Cyclooxygenase-2 Inhibitors Using Topologically-Based Numerical Descriptors , 2001, J. Chem. Inf. Comput. Sci..

[19]  Abha Eli Phoboo Machine Learning wins the Higgs Challenge , 2014 .

[20]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[21]  Paolo Benedetti,et al.  FLAP: GRID Molecular Interaction Fields in Virtual Screening. Validation using the DUD Data Set , 2010, J. Chem. Inf. Model..

[22]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[23]  Mohammed Mumtaz Al-Dabbagh,et al.  A Quantum-Based Similarity Method in Virtual Screening , 2015, Molecules.

[24]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[25]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[26]  Zheng Rong Yang,et al.  Biological applications of support vector machines , 2004, Briefings Bioinform..

[27]  C.A.L. Bailer-Jones,et al.  An introduction to artificial neural networks , 2001 .

[28]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .