TNFPred: identifying tumor necrosis factors using hybrid features based on word embeddings

Background Cytokines are a class of small proteins that act as chemical messengers and play a significant role in essential cellular processes including immunity regulation, hematopoiesis, and inflammation. As one important family of cytokines, tumor necrosis factors have association with the regulation of a various biological processes such as proliferation and differentiation of cells, apoptosis, lipid metabolism, and coagulation. The implication of these cytokines can also be seen in various diseases such as insulin resistance, autoimmune diseases, and cancer. Considering the interdependence between this kind of cytokine and others, classifying tumor necrosis factors from other cytokines is a challenge for biological scientists. In this research, we employed a word embedding technique to create hybrid features which was proved to efficiently identify tumor necrosis factors given cytokine sequences. We segmented each protein sequence into protein words and created corresponding word embedding for each word. Then, word embedding-based vector for each sequence was created and input into machine learning classification models. When extracting feature sets, we not only diversified segmentation sizes of protein sequence but also conducted different combinations among split grams to find the best features which generated the optimal prediction. Furthermore, our methodology follows Chou’s 5-step rules to build a reliable classification tool. Results With our proposed hybrid features, prediction models obtain more promising performance compared to seven prominent sequenced-based feature kinds. Results from 10 independent runs on the surveyed dataset show that on an average, our optimal models obtain an area under the curve of 0.984 and 0.998 on 5-fold cross-validation and independent test, respectively. Conclusions These results show that biologists can use our model to identify tumor necrosis factors from other cytokines efficiently. Moreover, this study proves that natural language processing techniques can be applied reasonably to help biologists solve bioinformatics problems efficiently.

[1]  Zhibin Li,et al.  Predicting Cytokines Based on Dipeptide and Length Feature , 2008, ICIC.

[2]  Dong Wang,et al.  Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation , 2015, NAACL.

[3]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[4]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[5]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[6]  Kuo-Chen Chou,et al.  iPPI-PseAAC(CGR): Identify protein-protein interactions by incorporating chaos game representation into PseAAC. , 2019, Journal of theoretical biology.

[7]  K. Chou,et al.  pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. , 2018, Genomics.

[8]  Yu-Yen Ou,et al.  DeepEfflux: a 2D convolutional neural network model for identifying families of efflux proteins in transporters , 2018, Bioinform..

[9]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[10]  Zhe Yang,et al.  A New Method for Recognizing Cytokines Based on Feature Combination and a Support Vector Machine Classifier , 2018, Molecules.

[11]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[12]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[13]  Yu-Yen Ou,et al.  Prediction of ATP-binding sites in membrane proteins using a two-dimensional convolutional neural network. , 2019, Journal of molecular graphics & modelling.

[14]  Jonathan Berant,et al.  Contextualized Word Representations for Reading Comprehension , 2017, NAACL.

[15]  C. Klebanoff,et al.  Beyond Cell Death: New Functions for TNF Family Cytokines in Autoimmunity and Tumor Immunotherapy. , 2018, Trends in molecular medicine.

[16]  K. Chou,et al.  iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition , 2014, PloS one.

[17]  Yong Lin,et al.  Tumor necrosis factor and cancer, buddies or foes? , 2008, Acta Pharmacologica Sinica.

[18]  Juan Carlos Fernández,et al.  Multiobjective evolutionary algorithms to identify highly autocorrelated areas: the case of spatial distribution in financially compromised farms , 2014, Ann. Oper. Res..

[19]  Po Hu,et al.  Learning Continuous Word Embedding with Metadata for Question Retrieval in Community Question Answering , 2015, ACL.

[20]  Sher Afzal Khan,et al.  iNuc-ext-PseTNC: an efficient ensemble model for identification of nucleosome positioning by extending the concept of Chou’s PseAAC to pseudo-tri-nucleotide composition , 2018, Molecular Genetics and Genomics.

[21]  Yu-Yen Ou,et al.  Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. , 2019, Analytical biochemistry.

[22]  E. Benveniste,et al.  Cytokine actions in the central nervous system. , 1998, Cytokine & growth factor reviews.

[23]  Kendall A. Smith,et al.  Following the cytokine signaling pathway to leukemogenesis: a chronology. , 2008, The Journal of clinical investigation.

[24]  Ahmad Hassan Butt,et al.  Predicting membrane proteins and their types by extracting various sequence features into Chou’s general PseAAC , 2018, Molecular Biology Reports.

[25]  Lionel B Ivashkiv,et al.  Type I interferon: a new player in TNF signaling. , 2010, Current directions in autoimmunity.

[26]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[27]  Yu-Yen Ou,et al.  iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding. , 2019, Analytical biochemistry.

[28]  Yoichiro Iwakura,et al.  Interdependence between Interleukin-1 and Tumor Necrosis Factor Regulates TNF-Dependent Control of Mycobacterium tuberculosis Infection , 2022 .

[29]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[30]  Shishir Shishodia,et al.  The role of TNF and its family members in inflammation and cancer: lessons from gene deletion. , 2002, Current drug targets. Inflammation and allergy.

[31]  Sneh Lata,et al.  CytoPred: a server for prediction and classification of cytokines. , 2008, Protein engineering, design & selection : PEDS.

[32]  Yu-Yen Ou,et al.  Prediction of FAD binding sites in electron transport proteins according to efficient radial basis function networks and significant amino acid pairs , 2016, BMC Bioinformatics.

[33]  Xiangxiang Zeng,et al.  Identification of cytokine via an improved genetic algorithm , 2014, Frontiers of Computer Science.

[34]  Ming Zhou,et al.  Learning Sentiment-Specific Word Embedding for Twitter Sentiment Classification , 2014, ACL.

[35]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[36]  Gholamreza Haffari,et al.  PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework. , 2018, Journal of theoretical biology.

[37]  Yu-Yen Ou,et al.  iMotor-CNN: Identifying molecular functions of cytoskeleton motor proteins using 2D convolutional neural network via Chou's 5-step rule. , 2019, Analytical biochemistry.

[38]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[39]  Z. Liao,et al.  Improved Identification of Cytokines Using Feature Selection Techniques , 2017 .

[40]  Lawrence Steinman,et al.  Nuanced roles of cytokines in three major human brain disorders. , 2008, The Journal of clinical investigation.

[41]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[42]  Yu-Yen Ou,et al.  Incorporating efficient radial basis function networks and significant amino acid pairs for predicting GTP binding sites in transport proteins , 2016, BMC Bioinformatics.

[43]  Zhirong Sun,et al.  CTKPred: an SVM-based method for the prediction and classification of the cytokine superfamily. , 2005, Protein engineering, design & selection : PEDS.

[44]  Iain B McInnes,et al.  Evidence that cytokines play a role in rheumatoid arthritis. , 2008, The Journal of clinical investigation.

[45]  M. Feldmann,et al.  Many cytokines are very useful therapeutic targets in disease. , 2008, The Journal of clinical investigation.

[46]  P. Barnes,et al.  The cytokine network in asthma and chronic obstructive pulmonary disease. , 2008, The Journal of clinical investigation.

[47]  Mandar Mitra,et al.  Word Embedding based Generalized Language Model for Information Retrieval , 2015, SIGIR.

[48]  Hiroshi Takayanagi,et al.  Inhibition of the TNF Family Cytokine RANKL Prevents Autoimmune Inflammation in the Central Nervous System. , 2015, Immunity.

[49]  Fuquan Zhang,et al.  Implications of Newly Identified Brain eQTL Genes and Their Interactors in Schizophrenia , 2018, Molecular therapy. Nucleic acids.

[50]  Yun Wu,et al.  Survey of Natural Language Processing Techniques in Bioinformatics , 2015, Comput. Math. Methods Medicine.

[51]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[52]  B. Aggarwal Signalling pathways of the TNF superfamily: a double-edged sword , 2003, Nature Reviews Immunology.

[53]  B. Liu,et al.  An Approach for Identifying Cytokines Based on a Novel Ensemble Classifier , 2013, BioMed research international.

[54]  Douglas L. Brutlag,et al.  Sequence Motifs: Highly Predictive Features of Protein Function , 2006, Feature Extraction.

[55]  N. Le iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule , 2019, Molecular Genetics and Genomics.

[56]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[57]  Yu-Yen Ou,et al.  Identifying the molecular functions of electron transport proteins using radial basis function networks and biochemical properties. , 2017, Journal of molecular graphics & modelling.

[58]  Kuo-Chen Chou,et al.  iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition , 2017, Oncotarget.