Predictive Power of Time-Series Based Machine Learning Models for DMPK Measurements in Drug Discovery

Four datasets measuring DMPK (drug metabolism and pharmacokinetics) parameters, and one target protein-specific dataset were analyzed by machine learning methods. Parameters measured for the five compound sets were biological activity data, plasma protein binding, permeability in MDCK I cell layers, intrinsic clearance by human liver microsomes, and plasma exposure in orally dosed rats. The measured data were sorted chronologically, reflecting the order in which they had been obtained in the discovery project. Subsets of the chronologically sorted data that appeared early in the project were used as training datasets to build predictive models for subsequent compounds based on kNN, partial least squares regression (PLSR), nonlinear PLSR, random forest regression, and support vector regression. A median model was used as a baseline to assess the machine learning model prediction quality. Data sets sorted in order of increasing test set prediction error: intrinsic clearance, plasma protein binding, cell layer permeability, biological activity on target protein, and bioavailability as AUC in rats. Our results give a first estimation of the power of machine learning to predict DMPK properties of compounds in an ongoing drug discovery project.