Prediction of novel mouse TLR9 agonists using a random forest approach

Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling. Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including ‘CC’, ‘GG’,‘AG’, ‘CCCG’ and ‘CGGC’ were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity. We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.

[1]  D. Nardo Toll-like receptors: Activation, signalling and transcriptional modulation. , 2015 .

[2]  Sumudu P Leelananda,et al.  Computational methods in drug discovery , 2016, Beilstein journal of organic chemistry.

[3]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[4]  Timothy Clark,et al.  2D-dynamic representation of DNA sequences , 2007 .

[5]  Vinícius Gonçalves Maltarollo,et al.  Applying machine learning techniques for ADME-Tox prediction: a review , 2015, Expert opinion on drug metabolism & toxicology.

[6]  Subhash C. Basak,et al.  Graphical Representation and Numerical Characterization of H5N1 Avian Flu Neuraminidase Gene Sequence , 2007, J. Chem. Inf. Model..

[7]  Mohamed Medhat Gaber,et al.  Random forests: from early developments to recent advancements , 2014 .

[8]  Roland Eils,et al.  circlize implements and enhances circular visualization in R , 2014, Bioinform..

[9]  Arijit Basu,et al.  Computational Discovery and Experimental Confirmation of TLR9 Receptor Antagonist Leads , 2016, J. Chem. Inf. Model..

[10]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[11]  Wei Zhou,et al.  Toll-like receptor 9 interaction with CpG ODN – An in silico analysis approach , 2013, Theoretical Biology and Medical Modelling.

[12]  S. Hochreiter,et al.  DeepTox: Toxicity prediction using deep learning , 2017 .

[13]  D. Davies,et al.  The structural biology of Toll-like receptors. , 2011, Structure.

[14]  Renfa Li,et al.  Coronavirus phylogeny based on triplets of nucleic acids bases , 2006, Chemical Physics Letters.

[15]  Nikola Bogunovic,et al.  A review of feature selection methods with applications , 2015, 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[16]  L. Ceriani,et al.  The origins of the Gini index: extracts from Variabilità e Mutabilità (1912) by Corrado Gini , 2012 .

[17]  Guosen Xie,et al.  Graphical Representation and Similarity Analysis of DNA Sequences Based on Trigonometric Functions , 2018, Acta biotheoretica.

[18]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[19]  Gajendra P. S. Raghava,et al.  VaccineDA: Prediction, design and genome-wide screening of oligodeoxynucleotide-based vaccine adjuvants , 2015, Scientific Reports.

[20]  D. Marcus,et al.  Discovering highly selective and diverse PPAR-delta agonists by ligand based machine learning and structural modeling , 2019, Scientific Reports.

[21]  Gustavo Henrique Goulart Trossini,et al.  Use of machine learning approaches for novel drug discovery , 2016, Expert opinion on drug discovery.

[22]  Michel Verleysen,et al.  The Curse of Dimensionality in Data Mining and Time Series Prediction , 2005, IWANN.

[23]  Varun Khanna,et al.  In silico approach to screen compounds active against parasitic nematodes of major socio-economic importance , 2011, BMC Bioinformatics.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  C. Rakers,et al.  Balancing Inflammation: Computational Design of Small-Molecule Toll-like Receptor Modulators. , 2017, Trends in pharmacological sciences.