Pre-processing Feature Selection for Improved C&RT Models for Oral Absorption

There are currently thousands of molecular descriptors that can be calculated to represent a chemical compound. Utilizing all molecular descriptors in Quantitative Structure-Activity Relationships (QSAR) modeling can result in overfitting, decreased interpretability, and thus reduced model performance. Feature selection methods can overcome some of these problems by drastically reducing the number of molecular descriptors and selecting the molecular descriptors relevant to the property being predicted. In particular, decision trees such as C&RT, although they have an embedded feature selection algorithm, can be inadequate since further down the tree there are fewer compounds available for descriptor selection, and therefore descriptors may be selected which are not optimal. In this work we compare two broad approaches for feature selection: (1) a "two-stage" feature selection procedure, where a pre-processing feature selection method selects a subset of descriptors, and then classification and regression trees (C&RT) selects descriptors from this subset to build a decision tree; (2) a "one-stage" approach where C&RT is used as the only feature selection technique. These methods were applied in order to improve prediction accuracy of QSAR models for oral absorption. Additionally, this work utilizes misclassification costs in model building to overcome the problem of the biased oral absorption data sets with more highly absorbed than poorly absorbed compounds. In most cases the two-stage feature selection with pre-processing approach had higher model accuracy compared with the one-stage approach. Using the top 20 molecular descriptors from the random forest predictor importance method gave the most accurate C&RT classification model. The molecular descriptors selected by the five filter feature selection methods have been compared in relation to oral absorption. In conclusion, the use of filter pre-processing feature selection methods and misclassification costs produce models with better interpretability and predictability for the prediction of oral absorption.

[1]  Taravat Ghafourian,et al.  The impact of training set data distributions for modelling of passive intestinal absorption. , 2012, International journal of pharmaceutics.

[2]  Tomoko Niwa,et al.  Using General Regression and Probabilistic Neural Networks To Predict Human Intestinal Absorption with Topological Descriptors Derived from Two-Dimensional Chemical Structures , 2003, J. Chem. Inf. Comput. Sci..

[3]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[4]  Jörg Huwyler,et al.  Combinatorial QSAR modeling of human intestinal absorption. , 2011, Molecular pharmaceutics.

[5]  Andreas Zell,et al.  Feature Selection for Descriptor Based Classification Models. 2. Human Intestinal Absorption (HIA) , 2004, J. Chem. Inf. Model..

[6]  A. Serajuddin,et al.  Relative lipophilicities, solubilities, and structure-pharmacological considerations of 3-hydroxy-3-methylglutaryl-coenzyme A (HMG-CoA) reductase inhibitors pravastatin, lovastatin, mevastatin, and simvastatin. , 1991, Journal of pharmaceutical sciences.

[7]  Anders Berglund,et al.  New and old trends in chemometrics. How to deal with the increasing data volumes in R&D&P (research, development and production)—with examples from pharmaceutical research and process modeling , 2002 .

[8]  U. Christians,et al.  Small intestinal metabolism of the 3-hydroxy-3-methylglutaryl-coenzyme A reductase inhibitor lovastatin and comparison with pravastatin. , 1999, The Journal of pharmacology and experimental therapeutics.

[9]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[10]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[11]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[12]  Ying Liu,et al.  A Comparative Study on Feature Selection Methods for Drug Discovery , 2004, J. Chem. Inf. Model..

[13]  Tingjun Hou,et al.  ADME Evaluation in Drug Discovery, 8. The Prediction of Human Intestinal Absorption by a Support Vector Machine , 2007, J. Chem. Inf. Model..

[14]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[15]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[16]  Leslie Z. Benet,et al.  Predicting Drug Disposition via Application of BCS: Transport/Absorption/ Elimination Interplay and Development of a Biopharmaceutics Drug Disposition Classification System , 2004, Pharmaceutical Research.

[17]  I. Kola,et al.  Can the pharmaceutical industry reduce attrition rates? , 2004, Nature Reviews Drug Discovery.

[18]  Wei Zhang,et al.  Recent advances in computational prediction of drug absorption and permeability in drug discovery. , 2006, Current medicinal chemistry.

[19]  A. Persidis High-throughput screening , 1998, Bio/Technology.

[20]  Kristina Luthman,et al.  Polar Molecular Surface Properties Predict the Intestinal Absorption of Drugs in Humans , 1997, Pharmaceutical Research.

[21]  R. W. Hansen,et al.  The price of innovation: new estimates of drug development costs. , 2003, Journal of health economics.

[22]  Tingjun Hou,et al.  ADME Evaluation in Drug Discovery, 7. Prediction of Oral Absorption by Correlation and Classification , 2007, J. Chem. Inf. Model..

[23]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[24]  Tingjun Hou,et al.  ADME evaluation in drug discovery , 2002, Journal of molecular modeling.

[25]  Lawrence M. Seiford,et al.  Recent developments in dea : the mathematical programming approach to frontier analysis , 1990 .

[26]  Bieke Dejaegher,et al.  Feature selection methods in QSAR studies. , 2012, Journal of AOAC International.

[27]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[28]  Peter C. Jurs,et al.  Prediction of Human Intestinal Absorption of Drug Compounds from Molecular Structure , 1998, J. Chem. Inf. Comput. Sci..

[29]  J. Kittler,et al.  Feature Set Search Alborithms , 1978 .

[30]  D. E. Clark Rapid calculation of polar molecular surface area and its application to the prediction of transport phenomena. 1. Prediction of intestinal absorption. , 1999, Journal of pharmaceutical sciences.

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  Wen‐Jun Zhang,et al.  Comparison of different methods for variable selection , 2001 .

[33]  Xiang-Qun Xie,et al.  Fast approaches for molecular polarizability calculations. , 2007, The journal of physical chemistry. A.

[34]  R Scott Obach,et al.  Physicochemical space for optimum oral bioavailability: contribution of human intestinal absorption and first-pass elimination. , 2010, Journal of medicinal chemistry.

[35]  D L Massart,et al.  Classification of drugs in absorption classes using the classification and regression trees (CART) methodology. , 2005, Journal of pharmaceutical and biomedical analysis.

[36]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[37]  H Lennernäs,et al.  Correlation of human jejunal permeability (in vivo) of drugs with experimentally and theoretically derived parameters. A multivariate data analysis approach. , 1998, Journal of medicinal chemistry.

[38]  M. Sherry Ku,et al.  Use of the Biopharmaceutical Classification System in Early Drug Development , 2008, The AAPS Journal.

[39]  H. van de Waterbeemd,et al.  ADMET in silico modelling: towards prediction paradise? , 2003, Nature reviews. Drug discovery.

[40]  Alex Alves Freitas,et al.  Coping with Unbalanced Class Data Sets in Oral Absorption Models , 2013, J. Chem. Inf. Model..

[41]  M. Cronin,et al.  The Impact of variable selection on the modelling of oestrogenicity , 2005, SAR and QSAR in environmental research.

[42]  M. Bunnage Getting pharmaceutical R&D back on target. , 2011, Nature chemical biology.

[43]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[44]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[45]  David J Greenblatt,et al.  Validation of serotonin (5-hydroxtryptamine) as an in vitro substrate probe for human UDP-glucuronosyltransferase (UGT) 1A6. , 2003, Drug metabolism and disposition: the biological fate of chemicals.

[46]  L. Hall,et al.  The Molecular Connectivity Chi Indexes and Kappa Shape Indexes in Structure‐Property Modeling , 2007 .

[47]  S Agatonovic-Kustrin,et al.  Theoretically-derived molecular descriptors important in human intestinal absorption. , 2001, Journal of pharmaceutical and biomedical analysis.

[48]  Andrew Kusiak,et al.  Data mining and genetic algorithm based gene/SNP selection , 2004, Artif. Intell. Medicine.

[49]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[50]  P. Selzer,et al.  Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. , 2000, Journal of medicinal chemistry.

[51]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[52]  Xin Chen,et al.  Effect of Molecular Descriptor Feature Selection in Support Vector Machine Classification of Pharmacokinetic and Toxicological Properties of Chemical Agents , 2004, J. Chem. Inf. Model..

[53]  Tomasz Arodz,et al.  Computational methods in developing quantitative structure-activity relationships (QSAR): a review. , 2006, Combinatorial chemistry & high throughput screening.

[54]  Forbes J. Burkowski,et al.  Using Kernel Alignment to Select Features of Molecular Descriptors in a QSAR Study , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[55]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[56]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[57]  Gordon L Amidon,et al.  A Mechanistic Approach to Understanding the Factors Affecting Drug Absorption: A Review of Fundamentals , 2002, Journal of clinical pharmacology.

[58]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[59]  Tingjun Hou,et al.  Recent developments of in silico predictions of intestinal absorption and oral bioavailability. , 2009, Combinatorial chemistry & high throughput screening.

[60]  E. Wang,et al.  HMG-CoA Reductase Inhibitors (Statins) Characterized as Direct Inhibitors of P-Glycoprotein , 2001, Pharmaceutical Research.

[61]  M. Brandsch,et al.  Pharmaceutical and pharmacological importance of peptide transporters , 2008, The Journal of pharmacy and pharmacology.

[62]  J. DiMasi,et al.  Risks in new drug development: Approval success rates for investigational drugs , 2001, Clinical pharmacology and therapeutics.