In silico classification of human maximum recommended daily dose based on modified random forest and substructure fingerprint.

A modified random forest (RF) algorithm, as a novel machine learning technique, was developed to estimate the maximum recommended daily dose (MRDD) of a large and diverse pharmaceutical dataset for phase I human trials using substructure fingerprint descriptors calculated from simple molecular structure alone. This type of novel molecular descriptors encodes molecular structure in a series of binary bits that represent the presence or absence of particular substructures in the molecule and thereby can accurately and directly depict a series of local information hidden in this molecule. Two model validation approaches, 5-fold cross-validation and an independent validation set, were used for assessing the prediction capability of our models. The results obtained in this study indicate that the modified RF gave prediction accuracy of 80.45%, sensitivity of 75.08%, specificity of 84.85% for 5-fold cross-validation, and prediction accuracy of 80.5%, sensitivity of 76.47%, specificity of 83.48% for independent validation set, respectively, which are as a whole better than those by the original RF. At the same time, the important substructure fingerprints, recognized by the RF technique, gave some insights into the structure features related to toxicity of pharmaceuticals. This could help provide intuitive understanding for medicinal chemists.

[1]  Mark T. D. Cronin,et al.  Predicting Chemical Toxicity and Fate , 2004 .

[2]  Harshinder Singh,et al.  Application of the Random Forest Method in Studies of Local Lymph Node Assay Based Skin Sensitization Data , 2005, J. Chem. Inf. Model..

[3]  B. Strom Risk assessment of drugs, biologics and therapeutic devices: present and future issues , 2003, Pharmacoepidemiology and drug safety.

[4]  A Maunz,et al.  Prediction of chemical toxicity with local support vector regression and activity-specific kernels , 2008, SAR and QSAR in environmental research.

[5]  Wolfgang Dekant,et al.  Toxicity assessment strategies, data requirements, and risk assessment approaches to derive health based guidance values for non-relevant metabolites of plant protection products. , 2010, Regulatory toxicology and pharmacology : RTP.

[6]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[7]  N. Kruhlak,et al.  Assessment of the health effects of chemicals in humans: I. QSAR estimation of the maximum recommended therapeutic dose (MRTD) and no effect level (NOEL) of organic chemicals based on clinical trial data. , 2004, Current drug discovery technologies.

[8]  Lars Kai Hansen,et al.  Neural Network Ensembles , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[10]  Robert C. Glen,et al.  Random Forest Models To Predict Aqueous Solubility , 2007, J. Chem. Inf. Model..

[11]  Frank Dieterle,et al.  Impact of biomarker development on drug safety assessment. , 2010, Toxicology and applied pharmacology.

[12]  Min Wang,et al.  Prediction of antibacterial compounds by machine learning approaches , 2009, J. Comput. Chem..

[13]  B. Fan,et al.  Molecular similarity and diversity in chemoinformatics: From theory to applications , 2006, Molecular Diversity.

[14]  Wolfgang Jahnke and Daniel A. Erlanson Fragment-based approaches in drug discovery , 2013 .

[15]  L. Breiman OUT-OF-BAG ESTIMATION , 1996 .

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.