Investigation of Machine Intelligence in Compound Cell Activity Classification.

Machine intelligence has been greatly developed in the past decades and has been widely used in many fields. Recent years many reports showed its satisfactory effect in drug discovery. In this study, machine intelligence methods were explored to assist the cell activity prediction. Multiple machine intelligence methods including support vector machine, decision tree, random forest, extra trees, gradient boosting machine, convolutional neural network, long short-term memory network, gated recurrent unit network were employed to separate compounds based on their cell activity. Different from some reported classification models, compounds were expressed as string by the SMILES and directly used as input rather than any chemical descriptors, which mimicked natural language processing. Both the single cell strain and whole dataset under the balanced and imbalanced data distribution were discussed, respectively. Different activity cutoffs were set for the single (Zscore = 3) and the whole (Zscore = 5 and Zscore = 6) data set. Nine metrics were used to evaluate the models including accuracy, precision, recall, f1-score, ROC AUC score, Cohen's Kappa, Brier score, Matthew's correlation coefficient and balanced accuracy. The results show that gradient boosting machine is competent at balanced data distribution and convolutional neural network is qualified for the imbalanced. The results demonstrate that both classic machine learning methods and deep learning methods have the potential in classification of compound cell activity.