Multi-output Perturbation-Theory Machine Learning (PTML) Model of ChEMBL Data for Antiretroviral Compounds.

Retroviral infections, such as HIV are, until now, diseases with no cure. Determining the target proteins of new antiretroviral compounds is a huge goal for medicine and pharmaceutical chemistry. ChEMBL manages Big Data features with complex dataset, which is hard to organize. This makes information difficult to analyze due to a big number of characteristics described in order to predict new drug candidates for retroviral infections. For this reason, we propose to develop a new predictive model combining Perturbation Theory (PT) bases and Machine Learning (ML) modelling to create a new tool that can take advantage of all the available information. The PTML model proposed in this work for ChEMBL dataset preclinical experimental assays for antiretroviral compounds consists in a linear equation with four variables. The PT operators used are based on multi-condition moving averages, combining different features and simplifying the difficulty to manage all data. More than 140,000 preclinical assays for 56,105 compounds with different characteristics or experimental conditions have been carried out and can be found in ChEMBL database, covering combinations with 359 biological activity parameters (c0), 55 different protein accessions (c1), 83 cell lines (c2), 64 organisms of assay (c3), and 773 subtypes or strains. We have included 150,148 preclinical experimental assays for HIV virus, 1,188 for HTLV virus, 84 for Simian immunodeficiency virus, 370 for Murine Leukemia virus, 119 for Rous sarcoma virus, 1,581 for MMTV, etc. We also included 5,277 assays for Hepatitis B virus. The developed PTML model reached considerable values in sensibility (73.05% for training and 73.10% for validation), specificity (86.61% for training and 87.17% for validation), and accuracy (75.84% for training and 75.98% for validation). We also compared alternative PTML models with different PT operators such as covariance, moments and exponential terms. Last, we compared the PTML model with other ML models from literature, and ANN nonlinear models. We conclude that this PTML model is the first one to consider multiple characteristics of preclinical experimental antiretroviral assays combined, generating a simple, useful and adaptable instrument, which could reduce time and costs in antiretroviral drugs research.