Feature Selection Methods Based on Genetic Algorithms for in Silico Drug Design

Publisher Summary High-throughput virtual screening is a means of accomplishing the goal of screening a library of molecules for potential drug activity, and the implementation of such virtual bioactivity screening relies on the development of predictive quantitative structure-activity relationship (QSAR) models. Three different approaches for feature selection for QSAR problems based on evolutionary algorithms (EA) are addressed in this chapter. These methods are based on common feature extraction with a genetic algorithm (GA) for a learning model, GA-scaled regression clustering, and GA-based feature selection from the correlation matrix. The chapter briefly explains the common GA-based method for feature selection in QSAR and expands on two novel approaches for feature selection. It also demonstrates a hybrid feature selection method combining GA-based feature selection methods with sensitivity analysis. A comparative benchmark for feature selection for an HIV-relevant QSAR model is also described. Although the feature selection methods are all GA-based, the predictive models are based on a back propagation-trained neural network and partial least squares. The goal of QSAR is to predict the bioactivity of molecules based on a set of descriptive features. The underlying assumption is that variations in biological activity can be correlated with characteristics in measured or calculated molecular properties. Several types of descriptors are traditionally used in QSAR investigations, including 2D, electrotopological, 3D, and transferable atom equivalent (TAE) descriptors.

[1]  Mark J. Embrechts,et al.  Supervised scaled regression clustering: an alternative to neural networks , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[2]  S. Wold,et al.  INLR, implicit non‐linear latent variable regression , 1997 .

[3]  Toshio Fujita,et al.  The Correlation of Biological Activity of Plant Growth Regulators and Chloromycetin Derivatives with Hammett Constants and Partition Coefficients , 1963 .

[4]  Ashwin Srinivasan,et al.  Biochemical Knowledge Discovery Using Inductive Logic Programming , 1998, Discovery Science.

[5]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[6]  D. Rogers,et al.  Some Theory and Examples of Genetic Function Approximation with Comparison to Evolutionary Techniques , 1996 .

[7]  Robert H. Kewley,et al.  Data strip mining for the virtual design of pharmaceuticals with neural networks , 2000, IEEE Trans. Neural Networks Learn. Syst..

[8]  David J. Livingstone,et al.  Data analysis for chemists , 1995 .

[9]  C. Breneman,et al.  QSPR analysis of HPLC column capacity factors for a set of high‐energy materials using electronic van der waals surface property descriptors computed by transferable atom equivalent method , 1997 .

[10]  C. Hansch,et al.  Comparative Quantitative Structure−Activity Relationship Studies on Anti-HIV Drugs , 1999 .

[11]  Peter C. Jurs,et al.  Automated Descriptor Selection for Quantitative Structure-Activity Relationships Using Generalized Simulated Annealing , 1995, J. Chem. Inf. Comput. Sci..

[12]  H. Kubinyi Variable Selection in QSAR Studies. II. A Highly Efficient Combination of Systematic Search and Evolution , 1994 .

[13]  Anton J. Hopfinger,et al.  Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships , 1994, J. Chem. Inf. Comput. Sci..

[14]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[15]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[16]  Stephen Muggleton,et al.  Application of machine learning to protein structure prediction and drug design , 1994 .

[17]  Hxugo Kubiny Variable Selection in QSAR Studies. I. An Evolutionary Algorithm , 1994 .

[18]  Alexander Tropsha,et al.  Novel Variable Selection Quantitative Structure-Property Relationship Approach Based on the k-Nearest-Neighbor Principle , 2000, J. Chem. Inf. Comput. Sci..

[19]  Stephen Muggleton,et al.  Knowledge Discovery in Biological and Chemical Domains , 1998, Discovery Science.

[20]  Jürgen Bajorath,et al.  Molecular Descriptors for Effective Classification of Biologically Active Compounds Based on Principal Component Analysis Identified by a Genetic Algorithm , 2000, J. Chem. Inf. Comput. Sci..

[21]  J. Murray,et al.  Relationships of molecular surface electrostatic potentials to some macroscopic properties , 1996 .

[22]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[23]  James E. Baker,et al.  Adaptive Selection Methods for Genetic Algorithms , 1985, International Conference on Genetic Algorithms.

[24]  Curt M. Breneman,et al.  Electron Density Modeling of Large Systems Using the Transferable Atom Equivalent Method , 1995, Comput. Chem..

[25]  Laveen N. Kanal,et al.  Classification, Pattern Recognition and Reduction of Dimensionality , 1982, Handbook of Statistics.

[26]  W. Dunn,et al.  Genetic Partial Least Squares in QSAR , 1996 .

[27]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[28]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[29]  Ilse C. F. Ipsen,et al.  THE IDEA BEHIND KRYLOV METHODS , 1998 .

[30]  Ron Kohavi,et al.  The Wrapper Approach , 1998 .

[31]  L. Darrell Whitley,et al.  The GENITOR Algorithm and Selection Pressure: Why Rank-Based Allocation of Reproductive Trials is Best , 1989, ICGA.

[32]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[33]  James C. Bezdek,et al.  Nearest prototype classification: clustering, genetic algorithms, or random search? , 1998, IEEE Trans. Syst. Man Cybern. Part C.

[34]  Kristin P. Bennett,et al.  Feature selection for in-silico drug design using genetic algorithms and neural networks , 2001, SMCia/01. Proceedings of the 2001 IEEE Mountain Workshop on Soft Computing in Industrial Applications (Cat. No.01EX504).

[35]  Ashwin Srinivasan,et al.  Theories for Mutagenicity: A Study in First-Order and Feature-Based Induction , 1996, Artif. Intell..

[36]  Jose C. Principe,et al.  Neural and adaptive systems : fundamentals through simulations , 2000 .