An Evolutionary Approach for Feature Selection applied to ADMET Prediction

Feature selection methods look for the selection of a subset of features or variables in a data set, such that these features are the most relevant for predicting a target value. In chemoinformatics context, the determination of the most significant set of descriptors is of great importance due to their contribution for improving ADMET prediction models. In this paper, an evolutionary-based approach for descriptor selection aimed to physicochemical property prediction is presented. In particular, we propose a genetic algorithm with a fitness function based on decision trees, which evaluates the relevance of a set of descriptors. Other fitness functions, based on multivariate regression models, were also tested. The performance of the genetic algorithm as a feature selection technique was assessed for predicting logP (octanol-water partition coefficient), using an ensemble of neural networks for the prediction task. The results showed that the evolutionary approach using decision trees is a promising technique for this bioinformatic application.

[1]  Søren Brunak,et al.  Prediction methods and databases within chemoinformatics : Emphasis on drugs and drug candidates , 2005 .

[2]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  Igor V. Tetko,et al.  Neural Network Modeling for Estimation of Partition Coefficient Based on Atom-Type Electrotopological State Indices , 2000, J. Chem. Inf. Comput. Sci..

[5]  S. Agatonovic-Kustrin,et al.  Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. , 2000, Journal of pharmaceutical and biomedical analysis.

[6]  J. Topliss,et al.  CHANCE FACTORS IN STUDIES OF QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS , 1980 .

[7]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[8]  David J. Livingstone,et al.  The Characterization of Chemical Structures Using Molecular Properties. A Survey , 2000, J. Chem. Inf. Comput. Sci..

[9]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[10]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[11]  Kalyanmoy Deb,et al.  A Comparative Analysis of Selection Schemes Used in Genetic Algorithms , 1990, FOGA.

[12]  Henry G. Grabowski,et al.  R&D Costs and Returns by Therapeutic Category , 2004 .

[13]  Rebecca Harris,et al.  Genetic algorithms and self-organizing maps: a powerful combination for modeling complex QSAR and QSPR problems , 2004, J. Comput. Aided Mol. Des..

[14]  H. Mewes,et al.  Can we estimate the accuracy of ADME-Tox predictions? , 2006, Drug discovery today.

[15]  A. J. Moores,et al.  Innovative genetic algorithms for chemoinformatics , 2002 .

[16]  Ron Kohavi,et al.  Wrappers for feature selection , 1997 .

[17]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[18]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[19]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[20]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[21]  M Karplus,et al.  Evolutionary optimization in quantitative structure-activity relationship: an application of genetic neural networks. , 1996, Journal of medicinal chemistry.

[22]  Anton J. Hopfinger,et al.  Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships , 1994, J. Chem. Inf. Comput. Sci..

[23]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[24]  Svante Wold,et al.  Partial least-squares method for spectrofluorimetric analysis of mixtures of humic acid and lignin sulfonate , 1983 .

[25]  Miquel Barceló,et al.  Inteligencia Artificial , 2001 .

[26]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .

[27]  Igor V. Tetko,et al.  Neural network studies, 1. Comparison of overfitting and overtraining , 1995, J. Chem. Inf. Comput. Sci..

[28]  Jouko Yliruusi,et al.  Prediction of physicochemical properties based on neural network modelling. , 2003, Advanced drug delivery reviews.

[29]  A. Beresford,et al.  The emerging importance of predictive ADME simulation in drug discovery. , 2002, Drug discovery today.

[30]  R. W. Hansen,et al.  The price of innovation: new estimates of drug development costs. , 2003, Journal of health economics.