CarcinoPred-EL: Novel models for predicting the carcinogenicity of chemicals using molecular fingerprints and ensemble learning methods

Carcinogenicity refers to a highly toxic end point of certain chemicals, and has become an important issue in the drug development process. In this study, three novel ensemble classification models, namely Ensemble SVM, Ensemble RF, and Ensemble XGBoost, were developed to predict carcinogenicity of chemicals using seven types of molecular fingerprints and three machine learning methods based on a dataset containing 1003 diverse compounds with rat carcinogenicity. Among these three models, Ensemble XGBoost is found to be the best, giving an average accuracy of 70.1 ± 2.9%, sensitivity of 67.0 ± 5.0%, and specificity of 73.1 ± 4.4% in five-fold cross-validation and an accuracy of 70.0%, sensitivity of 65.2%, and specificity of 76.5% in external validation. In comparison with some recent methods, the ensemble models outperform some machine learning-based approaches and yield equal accuracy and higher specificity but lower sensitivity than rule-based expert systems. It is also found that the ensemble models could be further improved if more data were available. As an application, the ensemble models are employed to discover potential carcinogens in the DrugBank database. The results indicate that the proposed models are helpful in predicting the carcinogenicity of chemicals. A web server called CarcinoPred-EL has been built for these models (http://ccsipb.lnu.edu.cn/toxicity/CarcinoPred-EL/).

[1]  Cheng Peng,et al.  Novel naïve Bayes classification models for predicting the carcinogenicity of chemicals. , 2016, Food and chemical toxicology : an international journal published for the British Industrial Biological Research Association.

[2]  Xing Chen,et al.  Long non-coding RNAs and complex diseases: from experimental results to computational models , 2016, Briefings Bioinform..

[3]  Zhu-Hong You,et al.  A novel approach based on KATZ measure to predict associations of human microbiota with non‐infectious diseases , 2016, Bioinform..

[4]  Jürgen Bajorath,et al.  Profile Scaling Increases the Similarity Search Performance of Molecular Fingerprints Containing Numerical Descriptors and Structural Keys , 2003, J. Chem. Inf. Comput. Sci..

[5]  Alessandro Giuliani,et al.  Alternatives to the carcinogenicity bioassay: in silico methods, and the in vitro and in vivo mutagenicity assays , 2010, Expert opinion on drug metabolism & toxicology.

[6]  B. Ames,et al.  Detection of carcinogens as mutagens in the Salmonella/microsome test: assay of 300 chemicals. , 1975, Proceedings of the National Academy of Sciences of the United States of America.

[7]  K. Nishida,et al.  Improvement of carcinogenicity prediction performances based on sensitivity analysis in variable selection of SVM models , 2013, SAR and QSAR in environmental research.

[8]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[9]  Naomi L Kruhlak,et al.  Comparison of MC4PC and MDL-QSAR rodent carcinogenicity predictions and the enhancement of predictive performance by combining QSAR models. , 2007, Regulatory toxicology and pharmacology : RTP.

[10]  Romualdo Benigni,et al.  Predicting the carcinogenicity of chemicals with alternative approaches: recent advances , 2014, Expert opinion on drug metabolism & toxicology.

[11]  Andrey A Toropov,et al.  CORAL software: prediction of carcinogenicity of drugs by means of the Monte Carlo method. , 2014, European journal of pharmaceutical sciences : official journal of the European Federation for Pharmaceutical Sciences.

[12]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[13]  Jingtian Hu,et al.  Predicting carcinogenicity of organic compounds based on CPDB. , 2015, Chemosphere.

[14]  G Zbinden,et al.  Toxicological screening. , 1984, Regulatory toxicology and pharmacology : RTP.

[15]  A. Jacobs,et al.  History of Chronic Toxicity and Animal Carcinogenicity Studies for Pharmaceuticals , 2013, Veterinary pathology.

[16]  Jie Shen,et al.  admetSAR: A Comprehensive Source and Free Tool for Assessment of Chemical ADMET Properties , 2012, J. Chem. Inf. Model..

[17]  F L Fort,et al.  Genotoxic potency in Drosophila melanogaster of selected aromatic amines and polycyclic aromatic hydrocarbons as assayed in the DNA repair test. , 1993, Mutation research.

[18]  Wei Xie,et al.  Computer-Aided Prediction of Rodent Carcinogenicity by PASS and CISOC-PSCT , 2009 .

[19]  Maykel Pérez González,et al.  Quantitative structure activity relationship for the computational prediction of nitrocompounds carcinogenicity. , 2006, Toxicology.

[20]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[21]  Francesca Mattioli,et al.  Update of carcinogenicity studies in animals and humans of 535 marketed pharmaceuticals. , 2012, Mutation research.

[22]  Romualdo Benigni,et al.  The Benigni / Bossa rulebase for mutagenicity and carcinogenicity - a module of Toxtree , 2008 .

[23]  J. Aronson,et al.  Post-marketing withdrawal of 462 medicinal products because of adverse drug reactions: a systematic review of the world literature , 2016, BMC Medicine.

[24]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[25]  R. Benigni,et al.  Nongenotoxic carcinogenicity of chemicals: mechanisms of action and early recognition through a new set of structural alerts. , 2013, Chemical reviews.

[26]  Zhigang Zhou,et al.  A QSAR Model of PAHs Carcinogenesis Based on Thermodynamic Stabilities of Biactive Sites , 2003, J. Chem. Inf. Comput. Sci..

[27]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[28]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[29]  Kuo-Chen Chou,et al.  iPhos-PseEn: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier , 2016, Oncotarget.

[30]  Andy Liaw,et al.  Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships , 2016, J. Chem. Inf. Model..

[31]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[32]  Xing Chen,et al.  HGIMDA: Heterogeneous graph inference for miRNA-disease association prediction , 2016, Oncotarget.

[33]  C. D. De Rosa,et al.  Prediction of the health effects of polychlorinated biphenyls (PCBs) and their metabolites using quantitative structure-activity relationship (QSAR). , 2008, Toxicology letters.

[34]  J. Dearden The History and Development of Quantitative Structure-Activity Relationships (QSARs) , 2016 .

[35]  Premanjali Rai,et al.  Predicting carcinogenicity of diverse chemicals using probabilistic neural network modeling approaches. , 2013, Toxicology and applied pharmacology.

[36]  Aboul Ella Hassanien,et al.  A Predictive Model for Toxicity Effects Assessment of Biotransformed Hepatic Drugs Using Iterative Sampling Method , 2016, Scientific Reports.

[37]  A. Zeileis Econometric Computing with HC and HAC Covariance Matrix Estimators , 2004 .

[38]  Kazutoshi Tanabe,et al.  Prediction of carcinogenicity for diverse chemicals based on substructure grouping and SVM modeling , 2010, Molecular Diversity.

[39]  Yongdong Zhang,et al.  Drug-target interaction prediction: databases, web servers and computational models , 2016, Briefings Bioinform..

[40]  R Benigni,et al.  Quantitative structure-activity relationships of mutagenic and carcinogenic aromatic amines. , 2000, Chemical reviews.

[41]  Chin Yee Liew,et al.  Mixed learning algorithms and features ensemble in hepatotoxicity prediction , 2011, J. Comput. Aided Mol. Des..

[42]  Ralph Kühne,et al.  Quantitative and qualitative models for carcinogenicity prediction for non-congeneric chemicals using CP ANN method for regulatory uses , 2010, Molecular Diversity.

[43]  J. Huff,et al.  Long‐Term Chemical Carcinogenesis Bioassays Predict Human Cancer Hazards: Issues, Controversies, and Uncertainties , 1999, Annals of the New York Academy of Sciences.

[44]  J Ashby,et al.  Mutagenicity to Salmonella, Drosophila and the mouse bone marrow of the human antineoplastic agent fotemustine: prediction of carcinogenic potency. , 1993, Mutation research.

[45]  Micha Rautenberg,et al.  lazar: a modular predictive toxicology framework , 2013, Front. Pharmacol..

[46]  Xing Chen,et al.  NLLSS: Predicting Synergistic Drug Combinations Based on Semi-supervised Learning , 2016, PLoS Comput. Biol..

[47]  Igor V. Tetko,et al.  ToxAlerts: A Web Server of Structural Alerts for Toxic Chemicals and Compounds with Potential Adverse Reactions , 2012, J. Chem. Inf. Model..

[48]  J. Huff,et al.  The carcinogenesis bioassay in perspective: application in identifying human cancer hazards. , 1995, Environmental health perspectives.

[49]  Christoph Helma,et al.  Lazy structure-activity relationships (lazar) for the prediction of rodent carcinogenicity and Salmonella mutagenicity , 2006, Molecular Diversity.

[50]  Ann M Richard,et al.  A novel approach: chemical relational databases, and the role of the ISSCAN database on assessing chemical carcinogenicity. , 2008, Annali dell'Istituto superiore di sanita.

[51]  Aixia Yan,et al.  Carcinogenicity prediction of noncongeneric chemicals by a support vector machine. , 2013, Chemical research in toxicology.

[52]  G. Friedman,et al.  Screening pharmaceuticals for possible carcinogenic effects: initial positive results for drugs not previously screened , 2009, Cancer Causes & Control.

[53]  Zengrui Wu,et al.  In Silico Estimation of Chemical Carcinogenicity with Binary and Ternary Classification Methods , 2015, Molecular informatics.

[54]  Vladimir B Bajic,et al.  In silico toxicology: computational methods for the prediction of chemical toxicity , 2016, Wiley interdisciplinary reviews. Computational molecular science.

[55]  Emilio Benfenati,et al.  New public QSAR model for carcinogenicity , 2010, Chemistry Central journal.

[56]  Emilio Benfenati,et al.  New clues on carcinogenicity-related substructures derived from mining two large datasets of chemical compounds , 2016, Journal of environmental science and health. Part C, Environmental carcinogenesis & ecotoxicology reviews.

[57]  Matthew D Segall,et al.  Addressing toxicity risk when designing and selecting compounds in early drug discovery. , 2014, Drug discovery today.

[58]  Adrià Cereto-Massagué,et al.  Molecular fingerprint similarity search in virtual screening. , 2015, Methods.

[59]  L. Gold,et al.  Supplement to the Carcinogenic Potency Database (CPDB): results of animal bioassays published in the general literature through 1997 and by the National Toxicology Program in 1997-1998. , 2005, Toxicological sciences : an official journal of the Society of Toxicology.

[60]  Xing Chen,et al.  IRWRLDA: improved random walk with restart for lncRNA-disease association prediction , 2016, Oncotarget.