iDTi-CSsmoteB: Identification of Drug–Target Interaction Based on Drug Chemical Structure and Protein Sequence Using XGBoost With Over-Sampling Technique SMOTE

Identifying interaction between drug and protein is a crucial challenge in drug discovery, which can lead the researchers to develop novel drug compounds or new target proteins for the existing drugs. The determination of drug–target interactions (DTIs) is an extremely time-consuming, costly, and tedious task with wet-lab experiments. To date, multiple computational techniques have been presented to simplify the drug discovery process, but a huge number of interactions are still undiscovered. Furthermore, a class imbalance is a critical challenge regarding this experiment which can significantly degrade the classification accuracy that has not been effectively addressed yet. In this paper, we proposed a novel high-throughput computational model, called iDTi-CSsmoteB, for identification of DTIs based on drug chemical structures and protein sequences. More specifically, the protein sequence is extracted through position-specific scoring matrix (PSSM)-Bigram, amphiphilic pseudo amino acid composition (AM-PseAAC) and dipeptide PseAAC descriptors which represents evolutionary and sequence information. The drug chemical structure is represented as a molecular substructure fingerprint (MSF) which describes the existence of the functional fragments or groups. Finally, we used the over-sampling SMOTE technique to overcome the imbalance issue of the datasets and applied XGBoost algorithm as a classifier to predict DTIs. To evaluate the performance of iDTi-CSsmoteB, several experiments have been conducted on four benchmark datasets, namely, enzyme, ion channel, GPCR, and nuclear receptor based on fivefold cross validation. The experimental analysis exhibits that our model outperforms similar methods in terms of area under the ROC (auROC) curve. In addition, our achieved results indicate the effectiveness of the feature extraction techniques, balancing methods, and classifier for predicting the DTIs which can provide substance for new drug development. iDTi-CSsmoteB webserver is available online at http://idticssmoteb-uestc.me/.

[1]  Qianzhong Li,et al.  Using pseudo amino acid composition to predict protein structural class: Approached by incorporating 400 dipeptide components , 2007, J. Comput. Chem..

[2]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[3]  Khurshid Ahmad,et al.  Identification of DNA binding proteins using evolutionary profiles position specific scoring matrix , 2016, Neurocomputing.

[4]  Guangya Zhang,et al.  Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou's amphiphilic pseudo-amino acid composition. , 2008, Journal of theoretical biology.

[5]  Stuart L. Schreiber,et al.  Dissecting glucose signalling with diversity-oriented synthesis and small-molecule microarrays , 2002, Nature.

[6]  Abdollah Dehzangi,et al.  iDTI-ESBoost: Identification of Drug Target Interaction Using Evolutionary and Structural Features with Boosting , 2017, Scientific Reports.

[7]  Antje Chang,et al.  BRENDA , the enzyme database : updates and major new developments , 2003 .

[8]  Zhu-Hong You,et al.  RFDT: A Rotation Forest-based Predictor for Predicting Drug-Target Interactions Using Drug Structure and Protein Sequence Information. , 2016, Current protein & peptide science.

[9]  David S. Wishart,et al.  DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs , 2010, Nucleic Acids Res..

[10]  Ali Masoudi-Nejad,et al.  Drug–target interaction prediction via chemogenomic space: learning-based methods , 2014, Expert opinion on drug metabolism & toxicology.

[11]  Jianyu Shi,et al.  Predicting existing targets for new drugs base on strategies for missing interactions , 2016, BMC Bioinformatics.

[12]  S. Khan,et al.  Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou's general PseAAC. , 2017, Journal of theoretical biology.

[13]  Dingfang Li,et al.  Drug-Target Interaction Prediction through Label Propagation with Linear Neighborhood Information , 2017, Molecules.

[14]  X. Chen,et al.  TTD: Therapeutic Target Database , 2002, Nucleic Acids Res..

[15]  Michael J. Keiser,et al.  Relating protein pharmacology by ligand chemistry , 2007, Nature Biotechnology.

[16]  Jie Li,et al.  SDTNBI: an integrated network and chemoinformatics tool for systematic prediction of drug–target interactions and drug repositioning , 2016, Briefings Bioinform..

[17]  Yoshihiro Yamanishi,et al.  Prediction of drug–target interaction networks from the integration of chemical and genomic spaces , 2008, ISMB.

[18]  Maqsood Hayat,et al.  Predicting subcellular localization of multi-label proteins by incorporating the sequence features into Chou's PseAAC. , 2019, Genomics.

[19]  Jian-Yu Shi,et al.  A unified solution for different scenarios of predicting drug-target interactions via triple matrix factorization , 2018, BMC Systems Biology.

[20]  P. Bork,et al.  Drug Target Identification Using Side-Effect Similarity , 2008, Science.

[21]  Minzhu Xie,et al.  XGBFEMF: An XGBoost-Based Framework for Essential Protein Prediction , 2018, IEEE Transactions on NanoBioscience.

[22]  Lu Huang,et al.  Update of TTD: Therapeutic Target Database , 2009, Nucleic Acids Res..

[23]  Keith C. C. Chan,et al.  Large-scale prediction of drug-target interactions from deep representations , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[24]  Jie Li,et al.  Prediction of Polypharmacological Profiles of Drugs by the Integration of Chemical, Side Effect, and Therapeutic Space , 2013, J. Chem. Inf. Model..

[25]  Andrew L. Hopkins,et al.  Predicting promiscuity , 2009 .

[26]  K. Chou,et al.  iCDI-PseFpt: identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints. , 2013, Journal of theoretical biology.

[27]  Bin Chen,et al.  PubChem as a Source of Polypharmacology , 2009, J. Chem. Inf. Model..

[28]  H. van de Waterbeemd,et al.  ADMET in silico modelling: towards prediction paradise? , 2003, Nature reviews. Drug discovery.

[29]  Xing Chen,et al.  In silico prediction of drug-target interaction networks based on drug chemical structure and protein sequences , 2017, Scientific Reports.

[30]  James G. Lyons,et al.  A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. , 2013, Journal of theoretical biology.

[31]  Yong Wang,et al.  Computationally Probing Drug-Protein Interactions Via Support Vector Machine , 2010 .

[32]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[33]  Zengrui Wu,et al.  Network-Based Methods for Prediction of Drug-Target Interactions , 2018, Front. Pharmacol..

[34]  Feng Xu,et al.  Therapeutic target database update 2014: a resource for targeted therapeutics , 2013, Nucleic Acids Res..

[35]  Philip E. Bourne,et al.  Drug Discovery Using Chemical Systems Biology: Weak Inhibition of Multiple Kinases May Contribute to the Anti-Cancer Effect of Nelfinavir , 2011, PLoS Comput. Biol..

[36]  Miriam Seoane Santos,et al.  Cross-Validation for Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches [Research Frontier] , 2018, IEEE Computational Intelligence Magazine.

[37]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[38]  Riccardo Bellazzi,et al.  PaPI: pseudo amino acid composition to score human protein-coding variants , 2015, BMC Bioinformatics.

[39]  Robert B. Russell,et al.  SuperTarget and Matador: resources for exploring drug-target relationships , 2007, Nucleic Acids Res..

[40]  Sahand Khakabimamaghani,et al.  Drug-target interaction prediction from PSSM based evolutionary information. , 2016, Journal of pharmacological and toxicological methods.

[41]  S. Haggarty,et al.  Multidimensional chemical genetic analysis of diversity-oriented synthesis-derived deacetylase inhibitors using cell-based assays. , 2003, Chemistry & biology.

[42]  Xing Chen,et al.  A Systematic Prediction of Drug-Target Interactions Using Molecular Fingerprints and Protein Sequences. , 2018, Current protein & peptide science.

[43]  Hui Yu,et al.  Predicting Drug-Target Interactions via Within-Score and Between-Score , 2015, BioMed research international.

[44]  Dong-Sheng Cao,et al.  Large-scale prediction of drug-target interactions using protein sequences and drug topological structures. , 2012, Analytica chimica acta.

[45]  T. Tsunoda,et al.  PSSM-Suc: Accurately predicting succinylation using position specific scoring matrix into bigram for feature extraction. , 2017, Journal of theoretical biology.

[46]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[47]  Dong-Sheng Cao,et al.  PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions , 2018, Journal of Cheminformatics.

[48]  Yoshihiro Yamanishi,et al.  Drug-target interaction prediction from chemical, genomic and pharmacological data in an integrated framework , 2010, Bioinform..

[49]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[50]  Chuang Liu,et al.  Prediction of Drug-Target Interactions and Drug Repositioning via Network-Based Inference , 2012, PLoS Comput. Biol..

[51]  Mehmet Gönen,et al.  Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization , 2012, Bioinform..

[52]  Chee Keong Kwoh,et al.  Drug-target interaction prediction via class imbalance-aware ensemble learning , 2016, BMC Bioinformatics.

[53]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[54]  Natalia Novac,et al.  Challenges and opportunities of drug repositioning. , 2013, Trends in pharmacological sciences.

[55]  Michael J. Keiser,et al.  Predicting new molecular targets for known drugs , 2009, Nature.

[56]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[57]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[58]  Hua Yu,et al.  A Systematic Prediction of Multiple Drug-Target Interactions from Chemical, Genomic, and Pharmacological Data , 2012, PloS one.

[59]  Saeed Ahmad,et al.  Improving prediction of extracellular matrix proteins using evolutionary information via a grey system model and asymmetric under-sampling technique , 2018 .

[60]  Kuldip K. Paliwal,et al.  A Tri-Gram Based Feature Extraction Technique Using Linear Probabilities of Position Specific Scoring Matrix for Protein Fold Recognition , 2014, IEEE Transactions on NanoBioscience.

[61]  Yong-Yeol Ahn,et al.  Optimizing drug–target interaction prediction based on random walk on heterogeneous networks , 2015, Journal of Cheminformatics.

[62]  Faisal Saeed,et al.  Bioactive Molecule Prediction Using Extreme Gradient Boosting , 2016, Molecules.

[63]  Stephen H. Bryant,et al.  Improved prediction of drug-target interactions using regularized least squares integrating with kernel fusion technique. , 2016, Analytica chimica acta.

[64]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[65]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[66]  Bin Yu,et al.  Predicting drug-target interactions using Lasso with random forest based on evolutionary information and chemical structure. , 2019, Genomics.

[67]  Lin He,et al.  Exploring Off-Targets and Off-Systems for Adverse Drug Reactions via Chemical-Protein Interactome — Clozapine-Induced Agranulocytosis as a Case Study , 2011, PLoS Comput. Biol..

[68]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[69]  George Papadatos,et al.  The ChEMBL bioactivity database: an update , 2013, Nucleic Acids Res..

[70]  Xiaomin Luo,et al.  TarFisDock: a web server for identifying drug targets with docking approach , 2006, Nucleic Acids Res..

[71]  Jian-Yu Shi,et al.  Predicting drug-target interaction for new drugs using enhanced similarity measures and super-target clustering. , 2015, Methods.

[72]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[73]  Hailin Chen,et al.  A Semi-Supervised Method for Drug-Target Interaction Prediction with Consistency in Networks , 2013, PloS one.

[74]  J. S. Cramer The Origins of Logistic Regression , 2002 .

[75]  Chunyan Miao,et al.  Neighborhood Regularized Logistic Matrix Factorization for Drug-Target Interaction Prediction , 2016, PLoS Comput. Biol..

[76]  Tapio Pahikkala,et al.  Toward more realistic drug^target interaction predictions , 2014 .

[77]  Alan Wee-Chung Liew,et al.  Sequence-Based Prediction of Protein-Carbohydrate Binding Sites Using Support Vector Machines , 2016, J. Chem. Inf. Model..

[78]  Nairanjana Dasgupta,et al.  An optimal set of features for predicting type IV secretion system effector proteins for a subset of species based on a multi-level feature selection approach , 2018, PloS one.

[79]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[80]  John B. O. Mitchell The Relationship between the Sequence Identities of Alpha Helical Proteins in the PDB and the Molecular Similarities of Their Ligands , 2001, J. Chem. Inf. Comput. Sci..

[81]  Damian Szklarczyk,et al.  STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data , 2015, Nucleic Acids Res..

[82]  Geoffrey I. Webb,et al.  POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles , 2017, Bioinform..

[83]  K. Chou,et al.  Predicting Drug-Target Interaction Networks Based on Functional Groups and Biological Features , 2010, PloS one.