Prediction of genotoxicity of chemical compounds by statistical learning methods.

Various toxicological profiles, such as genotoxic potential, need to be studied in drug discovery processes and submitted to the drug regulatory authorities for drug safety evaluation. As part of the effort for developing low cost and efficient adverse drug reaction testing tools, several statistical learning methods have been used for developing genotoxicity prediction systems with an accuracy of up to 73.8% for genotoxic (GT+) and 92.8% for nongenotoxic (GT-) agents. These systems have been developed and tested by using less than 400 known GT+ and GT- agents, which is significantly less in number and diversity than the 860 GT+ and GT- agents known at present. There is a need to examine if a similar level of accuracy can be achieved for the more diverse set of molecules and to evaluate other statistical learning methods not yet applied to genotoxicity prediction. This work is intended for testing several statistical learning methods by using 860 GT+ and GT- agents, which include support vector machines (SVM), probabilistic neural network (PNN), k-nearest neighbor (k-NN), and C4.5 decision tree (DT). A feature selection method, recursive feature elimination, is used for selecting molecular descriptors relevant to genotoxicity study. The overall accuracies of SVM, k-NN, and PNN are comparable to and those of DT lower than the results from earlier studies, with SVM giving the highest accuracies of 77.8% for GT+ and 92.7% for GT- agents. Our study suggests that statistical learning methods, particularly SVM, k-NN, and PNN, are useful for facilitating the prediction of genotoxic potential of a diverse set of molecules.

[1]  J W Green,et al.  A review of the genotoxicity of marketed pharmaceuticals. , 2001, Mutation research.

[2]  Ekaterina Gordeeva,et al.  Traditional topological indexes vs electronic, geometrical, and combined molecular descriptors in QSAR/QSPR research , 1993, J. Chem. Inf. Comput. Sci..

[3]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[4]  Thomas Hofmann,et al.  Predicting CNS Permeability of Drug Molecules: Comparison of Neural Network and Support Vector Machine Algorithms , 2002, J. Comput. Biol..

[5]  Subhash C. Basak,et al.  Prediction of Complement-Inhibitory Activity of Benzamidines Using Topological and Geometric Parameters , 1999, J. Chem. Inf. Comput. Sci..

[6]  Donald F. Specht,et al.  Probabilistic neural networks , 1990, Neural Networks.

[7]  Svetlana Vasilieva,et al.  SOS Chromotest methodology for fundamental genetic research. , 2002, Research in microbiology.

[8]  Nigel Greene,et al.  Computer systems for the prediction of toxicity: an update. , 2002, Advanced drug delivery reviews.

[9]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[10]  Peter C Jurs,et al.  Predicting the genotoxicity of thiophene derivatives from molecular structure. , 2003, Chemical research in toxicology.

[11]  G. Cash,et al.  Prediction of the genotoxicity of aromatic and heteroaromatic amines using electrotopological state indices. , 2001, Mutation research.

[12]  Sean B. Holden,et al.  Support Vector Machines for ADME Property Classification , 2003 .

[13]  A. Bolzán,et al.  Genotoxicity of streptozotocin. , 2002, Mutation research.

[14]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[15]  J E Roulston,et al.  Screening with tumor markers , 2002, Molecular biotechnology.

[16]  Y Xue,et al.  Prediction of torsade-causing potential of drugs by support vector machine approach. , 2004, Toxicological sciences : an official journal of the Society of Toxicology.

[17]  Stephen K. Durham,et al.  Predicting the Genotoxicity of Secondary and Aromatic Amines Using Data Subsetting To Generate a Model Ensemble , 2003, J. Chem. Inf. Comput. Sci..

[18]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[19]  Andreas Zell,et al.  Feature Selection for Descriptor Based Classification Models. 2. Human Intestinal Absorption (HIA) , 2004, J. Chem. Inf. Model..

[20]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[21]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[22]  Denis M. Bayada,et al.  Molecular Diversity and Representativity in Chemical Databases , 1999, J. Chem. Inf. Comput. Sci..

[23]  L. S. Davis,et al.  An assessment of support vector machines for land cover classi(cid:142) cation , 2002 .

[24]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[25]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[26]  M Pastor,et al.  VolSurf: a new tool for the pharmacokinetic optimization of lead compounds. , 2000, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[27]  T. Cacoullos Estimation of a multivariate density , 1966 .

[28]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[29]  H. Yu,et al.  Discovering compact and highly discriminative features or combinations of drug activities using support vector machines , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[30]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[31]  B Testa,et al.  Predicting blood-brain barrier permeation from three-dimensional molecular structure. , 2000, Journal of medicinal chemistry.

[32]  C A Marchant,et al.  Prediction of rodent carcinogenicity using the DEREK system for 30 chemicals currently being tested by the National Toxicology Program. The DEREK Collaborative Group. , 1996, Environmental health perspectives.

[33]  J. Ashby Fundamental structural alerts to potential carcinogenicity or noncarcinogenicity. , 1985, Environmental mutagenesis.

[34]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[35]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[36]  Eamonn F. Healy,et al.  Development and use of quantum mechanical molecular models. 76. AM1: a new general purpose quantum mechanical molecular model , 1985 .

[37]  Cesare Furlanello,et al.  An accelerated procedure for recursive feature ranking on microarray data , 2003, Neural Networks.

[38]  W. P. Purcell,et al.  Review of mutagenicity of monocyclic aromatic amines: quantitative structure-activity relationships. , 1997, Mutation research.

[39]  Bernard F. Buxton,et al.  Support Vector Machines in Combinatorial Chemistry , 2001 .

[40]  Yvan Saeys,et al.  Feature selection for splice site prediction: A new method using EDA-based feature ranking , 2004, BMC Bioinformatics.

[41]  Tong Zhang,et al.  An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods , 2001, AI Mag..

[42]  T. Kennedy Managing the drug discovery/development interface , 1997 .

[43]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[44]  H. van de Waterbeemd,et al.  ADMET in silico modelling: towards prediction paradise? , 2003, Nature reviews. Drug discovery.

[45]  J. F. Wang,et al.  Prediction of P-Glycoprotein Substrates by a Support Vector Machine Approach , 2004, J. Chem. Inf. Model..

[46]  M. Karelson,et al.  Quantum-Chemical Descriptors in QSAR/QSPR Studies. , 1996, Chemical reviews.

[47]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[48]  Juan M. Luco,et al.  Prediction of the Brain-Blood Distribution of a Large Set of Drugs from Structurally Derived Descriptors Using Partial Least-Squares (PLS) Modeling , 1999, J. Chem. Inf. Comput. Sci..

[49]  Peter C Jurs,et al.  Predicting the genotoxicity of polycyclic aromatic compounds from molecular structure with different classifiers. , 2003, Chemical research in toxicology.

[50]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[51]  Xin Chen,et al.  Effect of Molecular Descriptor Feature Selection in Support Vector Machine Classification of Pharmacokinetic and Toxicological Properties of Chemical Agents , 2004, J. Chem. Inf. Model..

[52]  Brian Carnahan,et al.  Comparing Statistical and Machine Learning Classifiers: Alternatives for Predictive Modeling in Human Factors Research , 2003, Hum. Factors.

[53]  M. Hofnung,et al.  The SOS chromotest: a review. , 1993, Mutation research.

[54]  P. Jurs,et al.  Development of binary classification of structural chromosome aberrations for a diverse set of organic compounds from molecular structure. , 2003, Chemical research in toxicology.

[55]  Johnson,et al.  Predicting human safety: screening and computational approaches. , 2000, Drug discovery today.

[56]  L. Hall,et al.  Molecular Structure Description: The Electrotopological State , 1999 .

[57]  Bernard De Baets,et al.  Feature subset selection for splice site prediction , 2002, ECCB.

[58]  D. Casciano,et al.  Genetic toxicology: Impact on the next generation of toxicology , 1998, Environmental and molecular mutagenesis.

[59]  R. Czerminski,et al.  Use of Support Vector Machine in Pattern Classification: Application to QSAR Studies , 2001 .

[60]  R. Snyder,et al.  Assessment of the sensitivity of the computational programs DEREK, TOPKAT, and MCASE in the prediction of the genotoxicity of pharmaceutical molecules , 2004, Environmental and molecular mutagenesis.