Prediction of chemical carcinogenicity by machine learning approaches

In this paper we report a successful application of machine learning approaches to the prediction of chemical carcinogenicity. Two different approaches, namely a support vector machine (SVM) and artificial neural network (ANN), were evaluated for predicting chemical carcinogenicity from molecular structure descriptors. A diverse set of 844 compounds, including 600 carcinogenic (CG+) and 244 noncarcinogenic (CG−) molecules, was used to estimate the accuracies of these approaches. The database was divided into two sets: the model construction set and the independent test set. Relevant molecular descriptors were selected by a hybrid feature selection method combining Fischer's score and Monte Carlo simulated annealing from a wide set of molecular descriptors, including physiochemical properties, constitutional, topological, and geometrical descriptors. The first model validation method was based a five-fold cross-validation method, in which the model construction set is split into five subsets. The five-fold cross-validation was used to select descriptors and optimise the model parameters by maximising the averaged overall accuracy. The final SVM model gave an averaged prediction accuracy of 90.7% for CG+ compounds, 81.6% for CG− compounds and 88.1% for the overall accuracy, while the corresponding ANN model provided an averaged prediction accuracy of 86.1% for CG+ compounds, 79.3% for CG− compounds and 84.2% for the overall accuracy. These results indicate that the hybrid feature selection method is very efficient and the selected descriptors are truly relevant to the carcinogenicity of compounds. Another model validation method, i.e. a hold-out method, was used to build the classification model using the selected descriptors and the optimised model parameters, in which the whole model construction set was used to build the classification model and the independent test set was used to test the predictive ability of the model. The SVM model gave a prediction accuracy of 87.6% for CG+ compounds, 79.1% for CG− compounds and 85.0% for the overall accuracy. The ANN model gave a prediction accuracy of 85.6% for CG+ compounds, 79.1% for CG− compounds and 83.6% for the overall accuracy. The results indicate that the built models are potentially useful for facilitating the prediction of chemical carcinogenicity of untested compounds.

[1]  Feng Luan,et al.  Classification of the carcinogenicity of N-nitroso compounds based on support vector machines and linear discriminant analysis. , 2005, Chemical research in toxicology.

[2]  A M Richard,et al.  Structure-based methods for predicting mutagenicity and carcinogenicity: are we there yet? , 1998, Mutation research.

[3]  Vladimir V Poroikov,et al.  Computer-aided rodent carcinogenicity prediction. , 2005, Mutation research.

[4]  N. Kruhlak,et al.  An analysis of genetic toxicity, reproductive and developmental toxicity, and carcinogenicity data: II. Identification of genotoxicants, reprotoxicants, and carcinogens using in silico methods. , 2006, Regulatory toxicology and pharmacology : RTP.

[5]  Romualdo Benigni,et al.  Designing safer drugs: (Q)SAR-based identification of mutagens and carcinogens. , 2003, Current topics in medicinal chemistry.

[6]  Y T Woo,et al.  Development of structure-activity relationship rules for predicting carcinogenic potential of chemicals. , 1995, Toxicology letters.

[7]  Luis G Valerio,et al.  Prediction of rodent carcinogenic potential of naturally occurring chemicals in the human diet using high-throughput QSAR predictive modeling. , 2007, Toxicology and applied pharmacology.

[8]  N Caporaso Study design and genetic susceptibility factors in the risk assessment of chemical carcinogens. , 1991, Annali dell'Istituto superiore di sanita.

[9]  Ekaterina Gordeeva,et al.  Traditional topological indexes vs electronic, geometrical, and combined molecular descriptors in QSAR/QSPR research , 1993, J. Chem. Inf. Comput. Sci..

[10]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[11]  Gisbert Schneider,et al.  Impact of descriptor vector scaling on the classification of drugs and nondrugs with artificial neural networks , 2004, Journal of molecular modeling.

[12]  J. F. Wang,et al.  Prediction of P-Glycoprotein Substrates by a Support Vector Machine Approach , 2004, J. Chem. Inf. Model..

[13]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[14]  Hao Zhu,et al.  ESP: A Method To Predict Toxicity and Pharmacological Properties of Chemicals Using Multiple MCASE Databases , 2004, J. Chem. Inf. Model..

[15]  H S Rosenkranz,et al.  International Commission for Protection Against Environmental Mutagens and Carcinogens. Approaches to SAR in carcinogenesis and mutagenesis. Prediction of carcinogenicity/mutagenicity using MULTI-CASE. , 1994, Mutation research.

[16]  Alexander Golbraikh,et al.  Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection , 2004, Molecular Diversity.

[17]  G M Pearl,et al.  Integration of computational analysis as a sentinel tool in toxicological assessments. , 2001, Current topics in medicinal chemistry.

[18]  Z R Li,et al.  Prediction of genotoxicity of chemical compounds by statistical learning methods. , 2005, Chemical research in toxicology.

[19]  Naomi L Kruhlak,et al.  Comparison of MC4PC and MDL-QSAR rodent carcinogenicity predictions and the enhancement of predictive performance by combining QSAR models. , 2007, Regulatory toxicology and pharmacology : RTP.

[20]  R. Benigni Structure-activity relationship studies of chemical mutagens and carcinogens: mechanistic investigations and prediction approaches. , 2005, Chemical reviews.

[21]  R. Fitzpatrick CPDB: Carcinogenic Potency Database , 2008, Medical Reference Services Quarterly.

[22]  Alexander Golbraikh,et al.  Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection , 2002, J. Comput. Aided Mol. Des..

[23]  W. Melssen,et al.  Selecting a representative training set for the classification of demolition waste using remote NIR sensing , 1999 .

[24]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[25]  J. Contrera,et al.  Predicting the carcinogenic potential of pharmaceuticals in rodents using molecular structural similarity and E-state indices. , 2003, Regulatory toxicology and pharmacology : RTP.

[26]  Nina Nikolova-Jeliazkova,et al.  An Approach to Determining Applicability Domains for QSAR Group Contribution Models: An Analysis of SRC KOWWIN , 2005, Alternatives to laboratory animals : ATLA.

[27]  C W Yap,et al.  Classification of a diverse set of Tetrahymena pyriformis toxicity chemical compounds from molecular descriptors by statistical learning methods. , 2006, Chemical research in toxicology.

[28]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[29]  Nina Nikolova-Jeliazkova,et al.  QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review , 2005, Alternatives to laboratory animals : ATLA.

[30]  R Posthumus,et al.  Validity and validation of expert (Q)SAR systems. , 2005, SAR and QSAR in environmental research.

[31]  Sudhir A. Kulkarni,et al.  Three-Dimensional QSAR Using the k-Nearest Neighbor Method and Its Interpretation , 2006, J. Chem. Inf. Model..

[32]  R. Czerminski,et al.  Use of Support Vector Machine in Pattern Classification: Application to QSAR Studies , 2001 .

[33]  Maykel Pérez González,et al.  A topological substructural approach applied to the computational prediction of rodent carcinogenicity. , 2005, Bioorganic & medicinal chemistry.

[34]  Z R Li,et al.  MODEL—molecular descriptor lab: A web‐based server for computing structural and physicochemical features of compounds , 2007, Biotechnology and bioengineering.

[35]  Ashwin Srinivasan,et al.  The Predictive Toxicology Challenge 2000-2001 , 2001, Bioinform..

[36]  Ivan Rusyn,et al.  The Use of Cell Viability Assay Data Improves the Prediction Accuracy of Conventional Quantitative Structure Activity Relationship Models of Animal Carcinogenicity , 2007 .

[37]  Alessandro Giuliani,et al.  Putting the Predictive Toxicology Challenge Into Perspective: Reflections on the Results , 2003, Bioinform..