Use of Chou’s 5-steps rule to predict the subcellular localization of gram-negative and gram-positive bacterial proteins by multi-label learning based on gene ontology annotation and profile alignment

To date, many proteins generated by large-scale genome sequencing projects are still uncharacterized and subject to intensive investigations by both experimental and computational means. Knowledge of protein subcellular localization (SCL) is of key importance for protein function elucidation. However, it remains a challenging task, especially for multiple sites proteins known to shuttle between cell compartments to perform their proper biological functions and proteins which do not have significant homology to proteins of known subcellular locations. Due to their low-cost and reasonable accuracy, machine learning-based methods have gained much attention in this context with the availability of a plethora of biological databases and annotated proteins for analysis and benchmarking. Various predictive models have been proposed to tackle the SCL problem, using different protein sequence features pertaining to the subcellular localization, however, the overwhelming majority of them focuses on single localization and cover very limited cellular locations. The prediction was basically established on sorting signals, amino acids compositions, and homology. To improve the prediction quality, focus is actually on knowledge information extracted from annotation databases, such as protein-protein interactions and Gene Ontology (GO) functional domains annotation which has been recently a widely adopted and essential information for learning systems. To deal with such problem, in the present study, we considered SCL prediction task as a multi-label learning problem and tried to label both single site and multiple sites unannotated bacterial protein sequences by mining proteins homology relationships using both GO terms of protein homologs and PSI-BLAST profiles. The experiments using 5-fold cross-validation tests on the benchmark datasets showed a significant improvement on the results obtained by the proposed consensus multi-label prediction model which discriminates six compartments for Gram-negative and five compartments for Gram-positive bacterial proteins.

[1]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[2]  J. Bujnicki,et al.  MetaLocGramN: A meta-predictor of protein subcellular localization for Gram-negative bacteria. , 2012, Biochimica et biophysica acta.

[3]  Liisa Holm,et al.  PANNZER: high-throughput functional annotation of uncharacterized proteins in an error-prone environment , 2015, Bioinform..

[4]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[5]  Piero Fariselli,et al.  BaCelLo: a balanced subcellular localization predictor , 2006, ISMB.

[6]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[7]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[8]  Oliver Kohlbacher,et al.  YLoc—an interpretable web server for predicting subcellular localization , 2010, Nucleic Acids Res..

[9]  Pier Luigi Martelli,et al.  MemLoci: predicting subcellular localization of membrane proteins in eukaryotes , 2011, Bioinform..

[10]  D. Eisenberg,et al.  Localizing proteins in the cell from their phylogenetic profiles. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[12]  Tianzi Jiang,et al.  Esub8: A novel tool to predict protein subcellular localizations in eukaryotic organisms , 2004, BMC Bioinformatics.

[13]  James G. Lyons,et al.  Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou׳s general PseAAC. , 2015, Journal of theoretical biology.

[14]  Kuo-Chen Chou,et al.  Gpos-mPLoc: a top-down approach to improve the quality of predicting subcellular localization of Gram-positive bacterial proteins. , 2009, Protein and peptide letters.

[15]  Kuo-Chen Chou,et al.  Methodology development for predicting subcellular localization and other attributes of proteins , 2007, Expert review of proteomics.

[16]  K. Chou Structural bioinformatics and its impact to biomedical science. , 2004, Current medicinal chemistry.

[17]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[18]  Sun-Yuan Kung,et al.  mPLR-Loc: an adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction. , 2015, Analytical biochemistry.

[19]  Liisa Holm,et al.  PANNZER2: a rapid functional annotation web server , 2018, Nucleic Acids Res..

[20]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[21]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[22]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[23]  Jae-Seong Yang,et al.  Construction of functional interaction networks through consensus localization predictions of the human proteome. , 2009, Journal of proteome research.

[24]  Sun-Yuan Kung,et al.  Gram-LocEN: Interpretable prediction of subcellular multi-localization of Gram-positive and Gram-negative bacterial proteins , 2017 .

[25]  Ke Wang,et al.  PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria , 2003, Nucleic Acids Res..

[26]  Michael T. Hallett,et al.  Refining Protein Subcellular Localization , 2005, PLoS Comput. Biol..

[27]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[28]  Jenn-Kang Hwang,et al.  Prediction of protein subcellular localization , 2006, Proteins.

[29]  Pufeng Du,et al.  PseAAC-General: Fast Building Various Modes of General Form of Chou’s Pseudo-Amino Acid Composition for Large-Scale Protein Datasets , 2014, International journal of molecular sciences.

[30]  Markus Brameier,et al.  BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm066 Sequence analysis NucPred—Predicting nuclear localization of proteins , 2007 .

[31]  Everton Alvares Cherman,et al.  Multi-label Problem Transformation Methods: a Case Study , 2011, CLEI Electron. J..

[32]  Sun-Yuan Kung,et al.  HybridGO-Loc: Mining Hybrid Features on Gene Ontology for Predicting Subcellular Localization of Multi-Location Proteins , 2014, PloS one.

[33]  Patrick Flick,et al.  GOATOOLS: A Python library for Gene Ontology analyses , 2018, Scientific Reports.

[34]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[35]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Protein cellular localization prediction with Support Vector Machines and Decision Trees , 2007, Comput. Biol. Medicine.

[36]  Kuo-Bin Li,et al.  AAIndexLoc: predicting subcellular localization of proteins based on a new representation of sequences using amino acid indices , 2008, Amino Acids.

[37]  Francisco Charte,et al.  Working with Multilabel Datasets in R: The mldr Package , 2015, R J..

[38]  Jenn-Kang Hwang,et al.  CELLO2GO: A Web Server for Protein subCELlular LOcalization Prediction with Functional Gene Ontology Annotation , 2014, PloS one.

[39]  Dong-Sheng Cao,et al.  protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences , 2015, Bioinform..

[40]  Hagit Shatkay,et al.  SherLoc2: a high-accuracy hybrid method for predicting subcellular localization of proteins. , 2009, Journal of proteome research.

[41]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[42]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[43]  Hans-Peter Kriegel,et al.  Supervised Ensembles of Prediction Methods for Subcellular Localization , 2009, APBC.

[44]  Pavel Skums,et al.  Inference of clonal selection in cancer populations using single-cell sequencing data , 2019, Bioinform..

[45]  Arun Krishnan,et al.  pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties , 2005, BMC Bioinformatics.

[46]  K. Chou Advance in predicting subcellular localization of multi-label proteins and its implication for developing multi-target drugs. , 2019, Current medicinal chemistry.

[47]  N. Bordin,et al.  Planctomycetes attached to algal surfaces: Insight into their genomes. , 2017, Genomics.

[48]  Satoru Miyano,et al.  Extensive feature detection of N-terminal protein sorting signals , 2002, Bioinform..

[49]  Asa Ben-Hur,et al.  Hierarchical Classification of Gene Ontology Terms Using the Gostruct Method , 2010, J. Bioinform. Comput. Biol..

[50]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[51]  Kuo-Chen Chou,et al.  pLoc_bal-mGpos: Predict subcellular localization of Gram-positive bacterial proteins by quasi-balancing training dataset and PseAAC. , 2019, Genomics.

[52]  Saso Dzeroski,et al.  An extensive experimental comparison of methods for multi-label learning , 2012, Pattern Recognit..

[53]  K. Chou,et al.  Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location* , 2002, The Journal of Biological Chemistry.

[54]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[55]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[56]  Rolf Apweiler,et al.  InterProScan: protein domains identifier , 2005, Nucleic Acids Res..

[57]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[58]  Xiangjun Liu,et al.  GNBSL: A new integrative system to predict the subcellular location for Gram‐negative bacteria proteins , 2006, Proteomics.

[59]  Lingling Zhao,et al.  A Novel Protein Subcellular Localization Method With CNN-XGBoost Model for Alzheimer's Disease , 2019, Front. Genet..

[60]  Shuo Li,et al.  MsDBP: Exploring DNA-binding Proteins by Integrating Multi-scale Sequence Information via Chou's 5-steps Rule. , 2019, Journal of proteome research.

[61]  Shi-Yun Wang,et al.  Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou's 5-steps rule and general pseudo components. , 2019, Genomics.

[62]  K. Chou,et al.  Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. , 2006, Biochemical and biophysical research communications.

[63]  Brian R. King,et al.  ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes , 2007, Genome biology.

[64]  K. Chou,et al.  Support vector machines for prediction of protein subcellular location by incorporating quasi‐sequence‐order effect , 2002, Journal of cellular biochemistry.

[65]  M. Kanehisa,et al.  A knowledge base for predicting protein localization sites in eukaryotic cells , 1992, Genomics.

[66]  Daoqiang Zhang,et al.  Deep model-based feature extraction for predicting protein subcellular localizations from bio-images , 2017, Frontiers of Computer Science.

[67]  George D. Magoulas,et al.  Analysing the localisation sites of proteins through neural networks ensembles , 2006, Neural Computing & Applications.

[68]  Dong Xu,et al.  MU-LOC: A Machine-Learning Method for Predicting Mitochondrially Localized Proteins in Plants , 2018, Front. Plant Sci..

[69]  Sun-Yuan Kung,et al.  Semantic Similarity over Gene Ontology for Multi-Label Protein Subcellular Localization , 2013 .

[70]  Muhammad Kabir,et al.  iNR-2L: A two-level sequence-based predictor developed via Chou's 5-steps rule and general PseAAC for identifying nuclear receptors and their families. , 2019, Genomics.

[71]  B. Koop,et al.  GO Trimming: Systematically reducing redundancy in large Gene Ontology datasets , 2011, BMC Research Notes.

[72]  Xing Chen,et al.  Prediction of subcellular location of apoptosis proteins by incorporating PsePSSM and DCCA coefficient based on LFDA dimensionality reduction , 2018, BMC Genomics.

[73]  Hagit Shatkay,et al.  SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. , 2007, Bioinformatics.

[74]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[75]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[76]  Burkhard Rost,et al.  Sequence conserved for subcellular localization , 2002, Protein science : a publication of the Protein Society.

[77]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[78]  G. Pollastri,et al.  SCLpredT: Ab initio and homology-based prediction of subcellular localization by N-to-1 neural networks , 2013, SpringerPlus.

[79]  Brian R. King,et al.  ngLOC: software and web server for predicting protein subcellular localization in prokaryotes and eukaryotes , 2012, BMC Research Notes.

[80]  K. Chou Some illuminating remarks on molecular genetics and genomics as well as drug development , 2020, Molecular Genetics and Genomics.

[81]  Stefano Toppo,et al.  Enhancing protein function prediction with taxonomic constraints--The Argot2.5 web server. , 2016, Methods.

[82]  Juan Miguel García-Gómez,et al.  BIOINFORMATICS APPLICATIONS NOTE Sequence analysis Manipulation of FASTQ data with Galaxy , 2005 .

[83]  K. Chou,et al.  iLoc-Gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex Gram-positive bacterial proteins. , 2012, Protein and peptide letters.

[84]  Ole Winther,et al.  DeepLoc: prediction of protein subcellular localization using deep learning , 2017, Bioinform..

[85]  M. Kanehisa,et al.  Expert system for predicting protein localization sites in gram‐negative bacteria , 1991, Proteins.

[86]  Kuo-Chen Chou,et al.  A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology. , 2003, Biochemical and biophysical research communications.

[87]  S. Wold,et al.  DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures , 1993 .

[88]  K. Chou,et al.  Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. , 2010, Journal of theoretical biology.

[89]  G. Heijne,et al.  ChloroP, a neural network‐based method for predicting chloroplast transit peptides and their cleavage sites , 1999, Protein science : a publication of the Protein Society.

[90]  Piero Fariselli,et al.  BUSCA: an integrative web server to predict subcellular localization of proteins , 2018, Nucleic Acids Res..

[91]  Trey Ideker,et al.  Protein networks markedly improve prediction of subcellular localization in multiple eukaryotic species , 2008, Nucleic acids research.

[92]  Piero Fariselli,et al.  SChloro: directing Viridiplantae proteins to six chloroplastic sub-compartments , 2016, Bioinform..

[93]  Abdollah Dehzangi,et al.  EvoStruct-Sub: An accurate Gram-positive protein subcellular localization predictor using evolutionary and structural features. , 2018, Journal of theoretical biology.

[94]  Sun-Yuan Kung,et al.  mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines , 2012, BMC Bioinformatics.

[95]  Liisa Holm,et al.  SANSparallel: interactive homology search against Uniprot , 2015, Nucleic Acids Res..

[96]  Kuo-Chen Chou,et al.  pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. , 2017, Genomics.