Optimal Sparse Descriptor Selection for QSAR Using Bayesian Methods

Choosing a set of molecular descriptors (features) that is most relevant to a given biological response variable is a very important problem in QSAR that has not be solved in an optimal robust way. It is an interesting and important class of mathematical problems, where the number of variables greatly outweighs the number of observations (grossly underdetermined systems). We have used two Bayesian approaches to carry out this task using a suite of QSAR data sets. We employed a specialized sparse Bayesian feature reduction method based on an EM algorithm with a Laplacian prior to select a small set of the most relevant descriptors for modeling the response variables from a much larger pool of possibilities. Having chosen the optimum descriptors in a supervised manner, we used a Bayesian regularized neural network to carry out nonlinear regression and derive robust parsimonious QSAR models for five drug data sets. Models were validated using independent test sets, and results compared with other contemporary descriptor selection methods. Issues around validating small QSAR data sets were also discussed in detail. The sparse feature selection algorithm proved to be an excellent, robust method for selecting descriptors for QSAR models, as it is supervised (descriptors chosen in a context-dependent manner), parsimonious (models not overly complex), and inherently interpretable. Coupled to a robust parsimonious nonlinear modeling method such as the Bayesian regularized neural net, the combination provides a means of optimally modeling the data, and allowing interpretation of the model in terms of the most relevant descriptors.

[1]  Anahita Kyani,et al.  Application of genetic algorithm-kernel partial least square as a novel nonlinear feature selection method: activity of carbonic anhydrase II inhibitors. , 2007, European journal of medicinal chemistry.

[2]  Yu-Dong Cai,et al.  Support vector machine for SAR/QSAR of phenethyl-amines , 2007, Acta Pharmacologica Sinica.

[3]  Martyn G. Ford,et al.  Unsupervised Forward Selection: A Method for Eliminating Redundant Variables , 2000, J. Chem. Inf. Comput. Sci..

[4]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[5]  David A Winkler,et al.  Neural networks as robust tools in drug lead discovery and development , 2004, Molecular biotechnology.

[6]  Frank R Burden,et al.  Broad-based quantitative structure-activity relationship modeling of potency and selectivity of farnesyltransferase inhibitors using a Bayesian regularized neural network. , 2004, Journal of medicinal chemistry.

[7]  Stefan H. Unger,et al.  Model building in structure-activity relations. Reexamination of adrenergic blocking activity of .beta.-halo-.beta.-arylalkylamines , 1973 .

[8]  Julio Caballero,et al.  Modeling of activity of cyclic urea HIV-1 protease inhibitors using regularized-artificial neural networks. , 2006, Bioorganic & medicinal chemistry.

[9]  Kimito Funatsu,et al.  The Recent Trend in QSAR Modeling - Variable Selection and 3D-QSAR Methods , 2007 .

[10]  Paola Gramatica,et al.  3D‐modelling and Prediction by WHIM Descriptors. Part 6. Application of WHIM Descriptors in QSAR Studies , 1997 .

[11]  John H. Kalivas,et al.  QSAR modeling based on the bias/variance compromise: a harmonious , 2004, J. Comput. Aided Mol. Des..

[12]  D. Manallack,et al.  Neural networks in drug discovery: Have they lived up to their promise? , 1999 .

[13]  David J. Livingstone,et al.  The Use of Artificial Neural Networks in QSAR , 1992 .

[14]  G Beck,et al.  Evaluation of human intestinal absorption data and subsequent derivation of a quantitative structure-activity relationship (QSAR) with the Abraham descriptors. , 2001, Journal of pharmaceutical sciences.

[15]  Frank R. Burden,et al.  Use of Automatic Relevance Determination in QSAR Studies Using Bayesian Neural Networks , 2000, J. Chem. Inf. Comput. Sci..

[16]  Igor V. Tetko,et al.  Data modelling with neural networks: Advantages and limitations , 1997, J. Comput. Aided Mol. Des..

[17]  Igor V. Pletnev,et al.  Drug Discovery Using Support Vector Machines. The Case Studies of Drug-likeness, Agrochemical-likeness, and Enzyme Inhibition Predictions , 2003, J. Chem. Inf. Comput. Sci..

[18]  Romualdo Benigni,et al.  Predictivity of QSAR , 2008, J. Chem. Inf. Model..

[19]  Mário A. T. Figueiredo Adaptive Sparseness for Supervised Learning , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[21]  Bahram Hemmateenejad,et al.  Genetic Algorithm Applied to the Selection of Factors in Principal Component-Artificial Neural Networks: Application to QSAR Study of Calcium Channel Antagonist Activity of 1, 4-Dihydropyridines (Nifedipine Analogous) , 2003, J. Chem. Inf. Comput. Sci..

[22]  Tariq A. Andrea Novel Structure—Activity Insights from Neural Network Models , 1995 .

[23]  Z R Li,et al.  Prediction of genotoxicity of chemical compounds by statistical learning methods. , 2005, Chemical research in toxicology.

[24]  Johann Gasteiger,et al.  Neural networks and genetic algorithms in drug design , 2001 .

[25]  Hu Mei,et al.  Support vector machine applied in QSAR modelling , 2005 .

[26]  D. Livingstone,et al.  Structure-activity relationships of antifilarial antimycin analogues: a multivariate pattern recognition study. , 1990, Journal of medicinal chemistry.

[27]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[28]  Ş. Niculescu Artificial neural networks and genetic algorithms in QSAR , 2003 .

[29]  J. Topliss,et al.  Chance correlations in structure-activity studies using multiple regression analysis , 1972 .

[30]  Tomas Öberg,et al.  A QSAR for Baseline Toxicity: Validation, Domain of Application, and Prediction , 2004 .

[31]  Shu-Shen Liu,et al.  VSMP: A Novel Variable Selection and Modeling Method Based on the Prediction , 2003, J. Chem. Inf. Comput. Sci..

[32]  R. Cramer,et al.  Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. , 1988, Journal of the American Chemical Society.

[33]  Klaus-Robert Müller,et al.  Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules , 2007, J. Comput. Aided Mol. Des..

[34]  M. Karplus,et al.  Genetic neural networks for quantitative structure-activity relationships: improvements and application of benzodiazepine affinity for benzodiazepine/GABAA receptors. , 1996, Journal of medicinal chemistry.

[35]  Angelo Carotti,et al.  QSAR and QSPR Studies of a Highly Structured Physicochemical Domain , 2006, J. Chem. Inf. Model..

[36]  Paola Gramatica,et al.  Principles of QSAR models validation: internal and external , 2007 .

[37]  L Xue,et al.  Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening. , 2000, Combinatorial chemistry & high throughput screening.

[38]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .

[39]  M Karplus,et al.  Evolutionary optimization in quantitative structure-activity relationship: an application of genetic neural networks. , 1996, Journal of medicinal chemistry.

[40]  Irina G. Tsygankova,et al.  Variable Selection in QSAR Models for Drug Design , 2008 .

[41]  F. Burden,et al.  Robust QSAR models using Bayesian regularized neural networks. , 1999, Journal of medicinal chemistry.

[42]  Martyn G. Ford,et al.  The structure/activity relationships of pyrethroid insecticides. 1. A Novel Approach Based upon the Use of Multivariate QSAR and Computational Chemistry , 1989 .

[43]  Frank R. Burden,et al.  Predictive Human Intestinal Absorption QSAR Models Using Bayesian Regularized Neural Networks , 2005 .

[44]  Walters Wp,et al.  Feature selection in quantitative structure-activity relationships. , 2005 .

[45]  Walter Cedeño,et al.  On the Use of Neural Network Ensembles in QSAR and QSPR , 2002, J. Chem. Inf. Comput. Sci..