A Wrapper-Based Feature Selection Method for ADMET Prediction Using Evolutionary Computing

Wrapper methods look for the selection of a subset of features or variables in a data set, in such a way that these features are the most relevant for predicting a target value. In chemoinformatics context, the determination of the most significant set of descriptors is of great importance due to their contribution for improving ADMET prediction models. In this paper, a comprehensive analysis of descriptor selection aimed to physicochemical property prediction is presented. In addition, we propose an evolutionary approach where different fitness functions are compared. The comparison consists in establishing which method selects the subset of descriptors that best predicts a given property, as well as maintaining the cardinality of the subset to a minimum. The performance of the proposal was assessed for predicting hydrophobicity, using an ensemble of neural networks for the prediction task. The results showed that the evolutionary approach using a non linear fitness function constitutes a novel and a promising technique for this bioinformatic application.

[1]  Søren Brunak,et al.  Prediction methods and databases within chemoinformatics : Emphasis on drugs and drug candidates , 2005 .

[2]  Thomas A. Darden,et al.  Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method , 2001, Bioinform..

[3]  Yanqing Zhang,et al.  A genetic algorithm-based method for feature subset selection , 2008, Soft Comput..

[4]  Julio Caballero,et al.  Modeling of Cyclin-Dependent Kinase Inhibition by 1H-Pyrazolo[3, 4-d]Pyrimidine Derivatives Using Artificial Neural Network Ensembles , 2005, J. Chem. Inf. Model..

[5]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[6]  Ron Kohavi,et al.  Wrappers for feature selection , 1997 .

[7]  J. Topliss,et al.  Chance factors in studies of quantitative structure-activity relationships. , 1979, Journal of medicinal chemistry.

[8]  Ting Chen,et al.  Ensemble Feature Selection: Consistent Descriptor Subsets for Multiple QSAR Models , 2007, J. Chem. Inf. Model..

[9]  Igor V. Tetko,et al.  Neural network studies, 1. Comparison of overfitting and overtraining , 1995, J. Chem. Inf. Comput. Sci..

[10]  Zexuan Zhu,et al.  Markov blanket-embedded genetic algorithm for gene selection , 2007, Pattern Recognit..

[11]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings , 1997 .

[12]  Jouko Yliruusi,et al.  Prediction of physicochemical properties based on neural network modelling. , 2003, Advanced drug delivery reviews.

[13]  A. Beresford,et al.  The emerging importance of predictive ADME simulation in drug discovery. , 2002, Drug discovery today.

[14]  Gérard Dreyfus,et al.  Toward a Principled Methodology for Neural Network Design and Performance Evaluation in QSAR. Application to the Prediction of LogP , 1998, J. Chem. Inf. Comput. Sci..

[15]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[16]  Luhua Lai,et al.  A New Atom-Additive Method for Calculating Partition Coefficients , 1997, J. Chem. Inf. Comput. Sci..

[17]  Rebecca Harris,et al.  Genetic algorithms and self-organizing maps: a powerful combination for modeling complex QSAR and QSPR problems , 2004, J. Comput. Aided Mol. Des..

[18]  H. Mewes,et al.  Can we estimate the accuracy of ADME-Tox predictions? , 2006, Drug discovery today.

[19]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[20]  Melanie Kah,et al.  Prediction of the adsorption of ionizable pesticides in soils. , 2007, Journal of agricultural and food chemistry.

[21]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[22]  Igor V. Tetko,et al.  Virtual Computational Chemistry Laboratory – Design and Description , 2005, J. Comput. Aided Mol. Des..

[23]  M Karplus,et al.  Evolutionary optimization in quantitative structure-activity relationship: an application of genetic neural networks. , 1996, Journal of medicinal chemistry.

[24]  Andreas Zell,et al.  Prediction of Aqueous Solubility and Partition Coefficient Optimized by a Genetic Algorithm Based Descriptor Selection Method , 2003, J. Chem. Inf. Comput. Sci..

[25]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[26]  David A. Winkler,et al.  Neural networks in ADME and toxicity prediction , 2004 .

[27]  Shu-Shen Liu,et al.  VSMP: A Novel Variable Selection and Modeling Method Based on the Prediction , 2003, J. Chem. Inf. Comput. Sci..

[28]  Alexandre Arenas,et al.  Fuzzy ARTMAP and Back-Propagation Neural Networks Based Quantitative Structure-Property Relationships (QSPRs) for Octanol-Water Partition Coefficient of Organic Compounds , 2002, J. Chem. Inf. Comput. Sci..

[29]  Igor V. Tetko,et al.  Neural Network Modeling for Estimation of Partition Coefficient Based on Atom-Type Electrotopological State Indices , 2000, J. Chem. Inf. Comput. Sci..

[30]  S. Agatonovic-Kustrin,et al.  Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. , 2000, Journal of pharmaceutical and biomedical analysis.

[31]  Kaj Madsen,et al.  Methods for Non-Linear Least Squares Problems , 1999 .

[32]  José Ranilla,et al.  A Hybrid Feature Selection Method for Text Categorization , 2007, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[33]  Kalyanmoy Deb,et al.  A Comparative Analysis of Selection Schemes Used in Genetic Algorithms , 1990, FOGA.

[34]  Francesco Falciani,et al.  GALGO: an R package for multivariate variable selection using genetic algorithms , 2006, Bioinform..

[35]  Kaj Madsen,et al.  Methods for Non-Linear Least Squares Problems (2nd ed.) , 2004 .