Coping with Unbalanced Class Data Sets in Oral Absorption Models

Class imbalance occurs frequently in drug discovery data sets. In oral absorption data sets, in the literature, there are considerably more highly absorbed compounds compared to poorly absorbed compounds. This produces models that are biased toward highly absorbed compounds which lack generalization to industry settings where more early stage drug candidates are poorly absorbed. This paper presents two strategies to cope with unbalanced class data sets: undersampling the majority high absorption class and misclassification costs using classification decision trees. The published data set by Hou et al. [J. Chem. Inf. Model.2007, 47, 208-218], which contained percentage human intestinal absorption of 645 drug and drug-like compounds, was used for the development and validation of classification trees using classification and regression tree (C&RT) analysis. The results indicate that undersampling the majority class, highly absorbed compounds, leads to a balanced distribution (50:50) training set which can achieve better accuracies for poorly absorbed compounds, whereas the biased training set achieved higher accuracies for highly absorbed compounds. The use of misclassification costs resulted in improved class predictions, when applied to reduce false positives or false negatives. Moreover, it was shown that the classical overall accuracy measure used in many publications is particularly misleading in the case of unbalanced data sets and more appropriate measures presented here may be used for a more realistic assessment of the classification models' performance. Thus, these strategies offer improvements to cope with unbalanced class data sets to obtain classification models applicable in industry.

[1]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[2]  Jörg Huwyler,et al.  Combinatorial QSAR modeling of human intestinal absorption. , 2011, Molecular pharmaceutics.

[3]  Yanli Wang,et al.  A novel method for mining highly imbalanced high-throughput screening data in PubChem , 2009, Bioinform..

[4]  R Scott Obach,et al.  Physicochemical space for optimum oral bioavailability: contribution of human intestinal absorption and first-pass elimination. , 2010, Journal of medicinal chemistry.

[5]  Aixia Yan,et al.  Prediction of Human Intestinal Absorption by GA Feature Selection and Support Vector Machine Regression , 2008, International journal of molecular sciences.

[6]  R L Nation,et al.  Prediction of drug absorption based on immobilized artificial membrane (IAM) chromatography separation and calculated molecular descriptors. , 2005, Journal of pharmaceutical and biomedical analysis.

[7]  Hongshi Yu,et al.  ADME-Tox in drug discovery: integration of experimental and computational technologies. , 2003, Drug discovery today.

[8]  L. Hall,et al.  The Molecular Connectivity Chi Indexes and Kappa Shape Indexes in Structure‐Property Modeling , 2007 .

[9]  John Comer,et al.  High‐Throughput Measurement of log D and pKa , 2004 .

[10]  Fumiyoshi Yamashita,et al.  Two‐ and three‐dimensional QSAR of carrier‐mediated transport of β‐lactam antibiotics in Caco‐2 cells , 2004 .

[11]  J. Gasteiger,et al.  ITERATIVE PARTIAL EQUALIZATION OF ORBITAL ELECTRONEGATIVITY – A RAPID ACCESS TO ATOMIC CHARGES , 1980 .

[12]  Feng Chen,et al.  Absorption, disposition, and pharmacokinetics of saponins from Chinese medicinal herbs: what do we know and what do we need to know more? , 2012, Current drug metabolism.

[13]  C. Lipinski Drug-like properties and the causes of poor solubility and poor permeability. , 2000, Journal of pharmacological and toxicological methods.

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Taravat Ghafourian,et al.  The impact of training set data distributions for modelling of passive intestinal absorption. , 2012, International journal of pharmaceutics.

[16]  Tomoko Niwa,et al.  Using General Regression and Probabilistic Neural Networks To Predict Human Intestinal Absorption with Topological Descriptors Derived from Two-Dimensional Chemical Structures , 2003, J. Chem. Inf. Comput. Sci..

[17]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[18]  Supa Hannongbua,et al.  In-silico ADME models: a general assessment of their utility in drug discovery applications. , 2011, Current topics in medicinal chemistry.

[19]  Rok Blagus,et al.  Class prediction for high-dimensional class-imbalanced data , 2010, BMC Bioinformatics.

[20]  Joseph V. Turner,et al.  Prediction of drug bioavailability based on molecular structure , 2003 .

[21]  Anne Hersey,et al.  Rate-Limited Steps of Human Oral Absorption and QSAR Studies , 2002, Pharmaceutical Research.

[22]  Matthew Segall,et al.  In silico prediction of ADME properties: are we making progress? , 2004, Current opinion in drug discovery & development.

[23]  K. Luthman,et al.  Correlation of drug absorption with molecular surface properties. , 1996, Journal of pharmaceutical sciences.

[24]  Peter C. Jurs,et al.  Prediction of Human Intestinal Absorption of Drug Compounds from Molecular Structure , 1998, J. Chem. Inf. Comput. Sci..

[25]  M. Bermejo,et al.  In Silico Prediction of Caco‐2 Cell Permeability by a Classification QSAR Approach , 2011, Molecular informatics.

[26]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[27]  Hadi Valizadeh,et al.  THE RELATION BETWEEN MOLECULAR PROPERTIES OF DRUGS AND THEIR TRANSPORT ACROSS THE INTESTINAL MEMBRANE , 2006 .

[28]  Yvan Vander Heyden,et al.  In silico predictions of ADME-Tox properties: drug absorption. , 2011, Combinatorial chemistry & high throughput screening.

[29]  Raymond T. Ng,et al.  A Model-Based Ensembling Approach for Developing QSARs , 2009, J. Chem. Inf. Model..

[30]  Shobha Bhattachar,et al.  The road map to oral bioavailability: an industrial perspective , 2006, Expert opinion on drug metabolism & toxicology.

[31]  Fumiyoshi Yamashita,et al.  Two- and three-dimensional QSAR of carrier-mediated transport of beta-lactam antibiotics in Caco-2 cells. , 2004, Journal of pharmaceutical sciences.

[32]  Robert M. Rydzewski,et al.  Real World Drug Discovery: A Chemist's Guide to Biotech and Pharmaceutical Research , 2008 .

[33]  Anne Hersey,et al.  On the mechanism of human intestinal absorption. , 2002, European journal of medicinal chemistry.

[34]  D L Massart,et al.  Classification of drugs in absorption classes using the classification and regression trees (CART) methodology. , 2005, Journal of pharmaceutical and biomedical analysis.

[35]  J. Buolamwini,et al.  Synthesis, flow cytometric evaluation, and identification of highly potent dipyridamole analogues as equilibrative nucleoside transporter 1 inhibitors. , 2007, Journal of medicinal chemistry.

[36]  R. E. White,et al.  High-throughput screening in drug metabolism and pharmacokinetic support of drug discovery. , 2000, Annual review of pharmacology and toxicology.

[37]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[38]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[39]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[40]  Y. Li,et al.  Estimation of bioconcentration factors using molecular electro-topological state and flexibility , 2008, SAR and QSAR in environmental research.

[41]  W. Youden,et al.  Index for rating diagnostic tests , 1950, Cancer.

[42]  Gilles Klopman,et al.  ADME evaluation. 2. A computer model for the prediction of intestinal absorption in humans. , 2002, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[43]  Kamaldeep K. Chohan,et al.  Advancements in Predictive In Silico Models for ADME , 2008 .

[44]  O. Engkvist,et al.  Beyond size, ionization state, and lipophilicity: influence of molecular topology on absorption, distribution, metabolism, excretion, and toxicity for druglike compounds. , 2012, Journal of medicinal chemistry.

[45]  Andrew M Davis,et al.  Components of successful lead generation. , 2005, Current topics in medicinal chemistry.

[46]  Tingjun Hou,et al.  ADME Evaluation in Drug Discovery, 7. Prediction of Oral Absorption by Correlation and Classification , 2007, J. Chem. Inf. Model..

[47]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[48]  Tingjun Hou,et al.  ADME evaluation in drug discovery , 2002, Journal of molecular modeling.

[49]  Huidong Yu,et al.  Recent developments of in silico predictions of oral bioavailability. , 2011, Combinatorial chemistry & high throughput screening.

[50]  D. Cummins,et al.  Pharmaceutical Drug Discovery: Designing the Blockbuster Drug , 2006, Screening.

[51]  S Agatonovic-Kustrin,et al.  Theoretically-derived molecular descriptors important in human intestinal absorption. , 2001, Journal of pharmaceutical and biomedical analysis.

[52]  Lawrence X. Yu,et al.  Predicting Human Oral Bioavailability of a Compound: Development of a Novel Quantitative Structure-Bioavailability Relationship , 2000, Pharmaceutical Research.

[53]  M. Kansy,et al.  Hydrogen-Bonding Capacity and Brain Penetration , 1992, Chimia (Basel).

[54]  Andreas Zell,et al.  Feature Selection for Descriptor Based Classification Models. 2. Human Intestinal Absorption (HIA) , 2004, J. Chem. Inf. Model..

[55]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[56]  Akira Tsuji,et al.  Transporter-mediated drug delivery: recent progress and experimental approaches. , 2004, Drug discovery today.

[57]  Wei Zhang,et al.  Recent advances in computational prediction of drug absorption and permeability in drug discovery. , 2006, Current medicinal chemistry.

[58]  Dan C. Fara,et al.  Lead-like, drug-like or “Pub-like”: how different are they? , 2007, J. Comput. Aided Mol. Des..

[59]  K. Pang Modeling of intestinal drug absorption: roles of transporters and metabolic enzymes (for the Gillette Review Series). , 2003, Drug metabolism and disposition: the biological fate of chemicals.

[60]  Yvan Vander Heyden,et al.  Classification Tree Models for the Prediction of Blood-Brain Barrier Passage of Drugs , 2006, J. Chem. Inf. Model..

[61]  Edward H. Kerns,et al.  Drug-like Properties: Concepts, Structure Design and Methods: from ADME to Toxicity Optimization , 2008 .

[62]  H. van de Waterbeemd,et al.  ADMET in silico modelling: towards prediction paradise? , 2003, Nature reviews. Drug discovery.

[63]  K. Wanner,et al.  Methods and Principles in Medicinal Chemistry , 2007 .

[64]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[65]  Andreas Bender,et al.  The challenges involved in modeling toxicity data in silico: a review. , 2012, Current pharmaceutical design.

[66]  Yi Li,et al.  In silico ADME/Tox: why models fail , 2003, J. Comput. Aided Mol. Des..

[67]  Maykel Pérez González,et al.  A topological sub-structural approach for predicting human intestinal absorption of drugs. , 2004, European journal of medicinal chemistry.

[68]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[69]  A. Talevi,et al.  Prediction of drug intestinal absorption by new linear and non-linear QSPR. , 2011, European journal of medicinal chemistry.

[70]  G Beck,et al.  Evaluation of human intestinal absorption data and subsequent derivation of a quantitative structure-activity relationship (QSAR) with the Abraham descriptors. , 2001, Journal of pharmaceutical sciences.