Automatic selection of molecular descriptors using random forest: Application to drug discovery

Random Forest based approach to improve the selection of molecular descriptors.Automatic features selection improves drug discovering methods accuracy.Reduction of complexity and time requirements allows to explore larger datasets. The optimal selection of chemical features (molecular descriptors) is an essential pre-processing step for the efficient application of computational intelligence techniques in virtual screening for identification of bioactive molecules in drug discovery. The selection of molecular descriptors has key influence in the accuracy of affinity prediction. In order to improve this prediction, we examined a Random Forest (RF)-based approach to automatically select molecular descriptors of training data for ligands of kinases, nuclear hormone receptors, and other enzymes. The reduction of features to use during prediction dramatically reduces the computing time over existing approaches and consequently permits the exploration of much larger sets of experimental data. To test the validity of the method, we compared the results of our approach with the ones obtained using manual feature selection in our previous study (Perez-Sanchez, Cano, and Garcia-Rodriguez, 2014).The main novelty of this work in the field of drug discovery is the use of RF in two different ways: feature ranking and dimensionality reduction, and classification using the automatically selected feature subset. Our RF-based method outperforms classification results provided by Support Vector Machine (SVM) and Neural Networks (NN) approaches.

[1]  Ritesh Kumar,et al.  Discovery of new enzymes and metabolic pathways using structure and genome context , 2016 .

[2]  J. Irwin,et al.  Benchmarking sets for molecular docking. , 2006, Journal of medicinal chemistry.

[3]  L. Johnson,et al.  The structure of a glycogen phosphorylase glucopyranose spirohydantoin complex at 1.8 Å resolution and 100 K: The role of the water structure and its contribution to binding , 1998, Protein science : a publication of the Protein Society.

[4]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[5]  R. Brenk,et al.  IspE Inhibitors Identified by a Combination of In Silico and In Vitro High-Throughput Screening , 2012, PloS one.

[6]  Yongsheng Ding,et al.  Using Chou's pseudo amino acid composition to predict subcellular localization of apoptosis proteins: An approach with immune genetic algorithm-based ensemble classifier , 2008, Pattern Recognit. Lett..

[7]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[10]  Pedro J Ballester,et al.  Machine‐learning scoring functions to improve structure‐based binding affinity prediction and virtual screening , 2015, Wiley interdisciplinary reviews. Computational molecular science.

[11]  Dong-Sheng Cao,et al.  ChemoPy: freely available python package for computational biology and chemoinformatics , 2013, Bioinform..

[12]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[13]  BlanzieriEnrico,et al.  A survey of learning-based techniques of email spam filtering , 2008 .

[14]  Nir London,et al.  Covalent Docking of Large Libraries for the Discovery of Chemical Probes , 2014, Nature chemical biology.

[15]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[16]  José García Rodríguez,et al.  Improving drug discovery using hybrid softcomputing methods , 2014, Appl. Soft Comput..

[17]  Frank Wien,et al.  Exploring the active site of herpes simplex virus type‐1 thymidine kinase by X‐ray crystallography of complexes with aciclovir and other ligands , 1998, Proteins.

[18]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[19]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[20]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[21]  Xuedong Yan,et al.  Exploring precrash maneuvers using classification trees and random forests. , 2009, Accident; analysis and prevention.

[22]  Eric W. T. Ngai,et al.  Customer churn prediction using improved balanced random forests , 2009, Expert Syst. Appl..

[23]  G. Keserű,et al.  Integration of virtual and high throughput screening in lead discovery settings. , 2011, Combinatorial chemistry & high throughput screening.

[24]  Peng Xu,et al.  Random forests and the data sparseness problem in language modeling , 2007, Comput. Speech Lang..

[25]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[26]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[27]  Jürgen Bajorath,et al.  Integration of virtual and high-throughput screening , 2002, Nature Reviews Drug Discovery.

[28]  Gopal Garg,et al.  Bioinformatics: A Review , 2016 .

[29]  Dirk Van den Poel,et al.  Predicting customer retention and profitability by using random forests and regression forests techniques , 2005, Expert Syst. Appl..

[30]  Dik-Lung Ma,et al.  Molecular docking for virtual screening of natural product databases , 2011 .

[31]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[32]  Timothy M Willson,et al.  A Ligand-mediated Hydrogen Bond Network Required for the Activation of the Mineralocorticoid Receptor*[boxs] , 2005, Journal of Biological Chemistry.

[33]  Li Qiyue,et al.  Comparisons of Random Forest and Support Vector Machine for Predicting Blasting Vibration Characteristic Parameters , 2011 .

[34]  Kjeld Rasmussen Handbook of pattern recognition and image processing. Ed. by Tzay Y. Young, King-Sun Fu. , 1987 .

[35]  John B. O. Mitchell,et al.  A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking , 2010, Bioinform..

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  E. Berner,et al.  Clinical Decision Support Systems: Theory and Practice , 1998 .

[38]  Hong Yan,et al.  Pattern recognition techniques for the emerging field of bioinformatics: A review , 2005, Pattern Recognit..

[39]  Guan-Hua Du,et al.  Integration of virtual screening with high-throughput screening for the identification of novel Rho-kinase I inhibitors. , 2010, Journal of biotechnology.

[40]  Jens Meiler,et al.  Iterative experimental and virtual high-throughput screening identifies metabotropic glutamate receptor subtype 4 positive allosteric modulators , 2012, Journal of Molecular Modeling.

[41]  Seong-Whan Lee,et al.  Advances in Handwriting Recognition , 1999, Series in Machine Perception and Artificial Intelligence.

[42]  David L. Brautigan,et al.  Discovery and characterization of small molecules that target the GTPase Ral , 2014, Nature.

[43]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[44]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .