In silico toxicity prediction by support vector machine and SMILES representation-based string kernel

There is a great need to assess the harmful effects or toxicities of chemicals to which man is exposed. In the present paper, the simplified molecular input line entry specification (SMILES) representation-based string kernel, together with the state-of-the-art support vector machine (SVM) algorithm, were used to classify the toxicity of chemicals from the US Environmental Protection Agency Distributed Structure-Searchable Toxicity (DSSTox) database network. In this method, the molecular structure can be directly encoded by a series of SMILES substrings that represent the presence of some chemical elements and different kinds of chemical bonds (double, triple and stereochemistry) in the molecules. Thus, SMILES string kernel can accurately and directly measure the similarities of molecules by a series of local information hidden in the molecules. Two model validation approaches, five-fold cross-validation and independent validation set, were used for assessing the predictive capability of our developed models. The results obtained indicate that SVM based on the SMILES string kernel can be regarded as a very promising and alternative modelling approach for potential toxicity prediction of chemicals.

[1]  Wolfgang Jahnke,et al.  Fragment-based Approaches in Drug Discovery: JAHNKE: FRAGMENT-BASED APPROACHES IN DRUG DISCOVERY O-BK , 2006 .

[2]  Alessio Ceroni,et al.  Classification of small molecules by two- and three-dimensional decomposition kernels , 2007, Bioinform..

[3]  M T Cronin Prediction of drug toxicity. , 2001, Farmaco.

[4]  M J Prival,et al.  Evaluation of the TOPKAT system for predicting the carcinogenicity of chemicals , 2001, Environmental and molecular mutagenesis.

[5]  Elaine Holmes,et al.  Prediction and classification of drug toxicity using probabilistic modeling of temporal metabolic data: the consortium on metabonomic toxicology screening approach. , 2007, Journal of proteome research.

[6]  J. Kazius,et al.  Derivation and validation of toxicophores for mutagenicity prediction. , 2005, Journal of medicinal chemistry.

[7]  David Weininger,et al.  SMILES. 2. Algorithm for generation of unique SMILES notation , 1989, J. Chem. Inf. Comput. Sci..

[8]  A Maunz,et al.  Prediction of chemical toxicity with local support vector regression and activity-specific kernels , 2008, SAR and QSAR in environmental research.

[9]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[10]  Wolfgang Jahnke and Daniel A. Erlanson Fragment-based approaches in drug discovery , 2013 .

[11]  John C. Dearden,et al.  In silico prediction of drug toxicity , 2003, J. Comput. Aided Mol. Des..

[12]  Jerzy Leszczynski,et al.  SMILES‐based optimal descriptors: QSAR analysis of fullerene‐based HIV‐1 PR inhibitors by means of balance of correlations , 2009, J. Comput. Chem..

[13]  G. Klopman Artificial intelligence approach to structure-activity studies. Computer automated structure evaluation of biological activity of organic molecules , 1985 .

[14]  X. Y. Zhang,et al.  Application of support vector machine (SVM) for prediction toxic activity of different data sets. , 2006, Toxicology.

[15]  Wolfgang Dekant,et al.  Toxicity assessment strategies, data requirements, and risk assessment approaches to derive health based guidance values for non-relevant metabolites of plant protection products. , 2010, Regulatory toxicology and pharmacology : RTP.

[16]  Pierre Baldi,et al.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity , 2005, ISMB.

[17]  Mark T. D. Cronin,et al.  Predicting Chemical Toxicity and Fate , 2004 .

[18]  Dong-Sheng Cao,et al.  Feature importance sampling‐based adaptive random forest as a useful tool to screen underlying lead compounds , 2011 .

[19]  Pierre Baldi,et al.  One- to Four-Dimensional Kernels for Virtual Screening and the Prediction of Physical, Chemical, and Biological Properties , 2007, J. Chem. Inf. Model..

[20]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[21]  Min Wang,et al.  Prediction of antibacterial compounds by machine learning approaches , 2009, J. Comput. Chem..

[22]  Gilles Klopman,et al.  The MultiCASE Program II. Baseline Activity Identification Algorithm (BAIA) , 1998, J. Chem. Inf. Comput. Sci..

[23]  H. van de Waterbeemd,et al.  ADMET in silico modelling: towards prediction paradise? , 2003, Nature reviews. Drug discovery.

[24]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[25]  Yuanyuan Wang,et al.  Predictive Toxicology: Benchmarking Molecular Descriptors and Statistical Methods , 2003, J. Chem. Inf. Comput. Sci..

[26]  Ovidiu Ivanciuc,et al.  Applications of Support Vector Machines in Chemistry , 2007 .

[27]  Dong-Sheng Cao,et al.  In silico classification of human maximum recommended daily dose based on modified random forest and substructure fingerprint. , 2011, Analytica chimica acta.

[28]  Dong-Sheng Cao,et al.  Prediction of aqueous solubility of druglike organic compounds using partial least squares, back‐propagation network and support vector machine , 2010 .

[29]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[30]  G. Klopman MULTICASE 1. A Hierarchical Computer Automated Structure Evaluation Program , 1992 .

[31]  Hao Zhu,et al.  ESP: A Method To Predict Toxicity and Pharmacological Properties of Chemicals Using Multiple MCASE Databases , 2004, J. Chem. Inf. Model..

[32]  E Benfenati,et al.  SMILES-based optimal descriptors: QSAR modeling of carcinogenicity by balance of correlations with ideal slopes. , 2010, European journal of medicinal chemistry.

[33]  Y T Woo,et al.  Development of structure-activity relationship rules for predicting carcinogenic potential of chemicals. , 1995, Toxicology letters.

[34]  Stephen Dunn Smiles , 1932 .

[35]  D. Sanderson,et al.  Computer Prediction of Possible Toxic Action from Chemical Structure; The DEREK System , 1991, Human & experimental toxicology.