Evaluation of predictive models based on random forest, decision tree and support vector machine classifiers and virtual screening of anti-mycobacterial compounds

Three machine learning classifiers: random forest, decision tree and support vector machine were used to build predictive models of an anti-mycobacterial ChEMBL database and evaluated for their predictive capability. Before the development of predictive models, data pre-processing was carried out to fix the class imbalance problem by applying cost-sensitive classifier, and filtration of data instance by supervised synthetic minority oversampling technique (SMOTE), spread subsample and resample method. The statistical evaluation indicated that random forest model was the best model as it showed the best accuracy 93.83%, specificity 90.5%, receiver operating characteristic (ROC) 0.984, MCC 0.772 and kappa statistics 0.768 in comparison to other models whereas LibSVM showed the highest sensitivity 94.4% compared with others. Additionally, toxicity predictive models based on SingleCellcall DSSTox carcinogenicity database (AID1189) was developed which resulted in random forest model as the best model. The deployment of both RF predictive models on two unknown datasets resulted in 1317 compounds out of 1554 approved drugs and 2234 compounds out of 18,746 ChEMBL anti-malarial dataset as non-toxic and anti-mycobacterial compounds. Thus machine learning models present highly efficient methods to find out novel hit anti-mycobacterial compounds. We suggest that such machine learning techniques could be very useful to screen drug candidates not only for tuberculosis but also for other diseases.