NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data

Non-classically secreted proteins (NCSPs) are proteins that are located in the extracellular environment, although there is a lack of known signal peptides or secretion motifs. They usually perform different biological functions in intracellular and extracellular environments, and several of their biological functions are linked to bacterial virulence and cell defence. Accurate protein localization is essential for all living organisms, however, the performance of existing methods developed for NCSP identification has been unsatisfactory and in particular suffer from data deficiency and possible overfitting problems. Further improvement is desirable, especially to address the lack of informative features and mining subset-specific features in imbalanced datasets. In the present study, a new computational predictor was developed for NCSP prediction of gram-positive bacteria. First, to address the possible prediction bias caused by the data imbalance problem, ten balanced subdatasets were generated for ensemble model construction. Then, the F-score algorithm combined with sequential forward search was used to strengthen the feature representation ability for each of the training subdatasets. Third, the subset-specific optimal feature combination process was adopted to characterize the original data from different aspects, and all subdataset-based models were integrated into a unified model, NonClasGP-Pred, which achieved an excellent performance with an accuracy of 93.23 %, a sensitivity of 100 %, a specificity of 89.01 %, a Matthew's correlation coefficient of 87.68 % and an area under the curve value of 0.9975 for ten-fold cross-validation. Based on assessment on the independent test dataset, the proposed model outperformed state-of-the-art available toolkits. For availability and implementation, see: http://lab.malab.cn/~wangchao/softwares/NonClasGP/.

[1]  G. Yen,et al.  A Consensus Community-Based Particle Swarm Optimization for Dynamic Community Detection , 2020, IEEE Transactions on Cybernetics.

[2]  J. Beckwith The Sec-dependent pathway. , 2013, Research in microbiology.

[3]  N. Blom,et al.  Feature-based prediction of non-classical and leaderless protein secretion. , 2004, Protein engineering, design & selection : PEDS.

[4]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[5]  Xiangrong Liu,et al.  deepDR: a network-based deep learning approach to in silico drug repositioning , 2019, Bioinform..

[6]  Fei Guo,et al.  Critical evaluation of web-based prediction tools for human protein subcellular localization , 2019, Briefings Bioinform..

[7]  Jijun Tang,et al.  Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC. , 2019, Journal of theoretical biology.

[8]  Geoffrey I. Webb,et al.  iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences , 2018, Bioinform..

[9]  Cong Shen,et al.  LPI-KTASLP: Prediction of LncRNA-Protein Interaction by Semi-Supervised Link Learning With Multivariate Information , 2019, IEEE Access.

[10]  Daniel Restrepo-Montoya,et al.  NClassG+: A classifier for non-classically secreted Gram-positive bacterial proteins , 2011, BMC Bioinformatics.

[11]  Wei Chen,et al.  The Secretion of an Intrinsically Disordered Protein with Different Secretion Signals in Bacillus subtilis , 2013, Current Microbiology.

[12]  Wei Chen,et al.  How are the Non-classically Secreted Bacterial Proteins Released into the Extracellular Milieu? , 2013, Current Microbiology.

[13]  Jijun Tang,et al.  Identification of drug-side effect association via multiple information integration with centered kernel alignment , 2019, Neurocomputing.

[14]  Gajendra P S Raghava,et al.  Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition* , 2004, Journal of Biological Chemistry.

[15]  Bin Liu,et al.  MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks , 2019, Briefings Bioinform..

[16]  Xiangxiang Zeng,et al.  Inferring MicroRNA-Disease Associations by Random Walk on a Heterogeneous Network with Multiple Data Sources , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Ran Su,et al.  M6APred-EL: A Sequence-Based Predictor for Identifying N6-methyladenosine Sites Using Ensemble Learning , 2018, Molecular therapy. Nucleic acids.

[18]  Han Zhang,et al.  BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , 2019, Nucleic acids research.

[19]  Xiangxiang Zeng,et al.  MOEA/HD: A Multiobjective Evolutionary Algorithm Based on Hierarchical Decomposition , 2019, IEEE Transactions on Cybernetics.

[20]  Lukasz Kurgan,et al.  Prediction of protein crystallization using collocation of amino acid pairs. , 2007, Biochemical and biophysical research communications.

[21]  Shuguang Han,et al.  Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification , 2020, BioMed research international.

[22]  Vijayakumar Saravanan,et al.  Harnessing Computational Biology for Exact Linear B-Cell Epitope Prediction: A Novel Amino Acid Composition-Based Feature Descriptor. , 2015, Omics : a journal of integrative biology.

[23]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[24]  Hao Wang,et al.  Identification of membrane protein types via multivariate information fusion with Hilbert-Schmidt Independence Criterion , 2020, Neurocomputing.

[25]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[26]  Leyi Wei,et al.  A novel hierarchical selective ensemble classifier with bioinformatics application , 2017, Artif. Intell. Medicine.

[27]  Kai Li,et al.  iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features , 2019, Molecular therapy. Nucleic acids.

[28]  Alfonso Rodríguez-Patón,et al.  Meta-Path Methods for Prioritizing Candidate Disease miRNAs , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Xiangxiang Zeng,et al.  Prediction and Validation of Disease Genes Using HeteSim Scores , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  A. Driessen,et al.  The structural basis of protein targeting and translocation in bacteria , 2001, Nature Structural Biology.

[31]  Liang Yu,et al.  Conserved Disease Modules Extracted From Multilayer Heterogeneous Disease and Gene Networks for Understanding Disease Mechanisms and Predicting Disease Treatments , 2019, Front. Genet..

[32]  Cheng Chen,et al.  Prediction of Extracellular Matrix Proteins by Fusing Multiple Feature Information, Elastic Net, and Random Forest Algorithm , 2020, Mathematics.

[33]  Xin Song,et al.  Common Non-classically Secreted Bacterial Proteins with Experimental Evidence , 2015, Current Microbiology.

[34]  Geoffrey I. Webb,et al.  iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites , 2018, Briefings Bioinform..

[35]  Geoffrey I. Webb,et al.  Procleave: Predicting Protease-specific Substrate Cleavage Sites by Combining Sequence and Structural Information , 2020, Genom. Proteom. Bioinform..

[36]  Lin Gao,et al.  Predict New Therapeutic Drugs for Hepatocellular Carcinoma Based on Gene Mutation and Expression , 2020, Frontiers in Bioengineering and Biotechnology.

[37]  Jiangning Song,et al.  PASSION: an ensemble neural network approach for identifying the binding sites of RBPs on circRNAs , 2020, Bioinform..

[38]  Cheng Chen,et al.  SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting , 2020, Bioinform..

[39]  Qian Kang,et al.  Principle and potential applications of the non-classical protein secretory pathway in bacteria , 2019, Applied Microbiology and Biotechnology.

[40]  Bin Liu,et al.  DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks , 2019, Briefings Bioinform..

[41]  Jiangning Song,et al.  Quokka: a comprehensive tool for rapid and accurate prediction of kinase family‐specific phosphorylation sites in the human proteome , 2018, Bioinform..

[42]  D. Horne,et al.  Prediction of protein helix content from an autocorrelation analysis of sequence hydrophobicities , 1988, Biopolymers.

[43]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[44]  Jiangning Song,et al.  PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins , 2019, Bioinform..

[45]  Jijun Tang,et al.  DeepAVP: A Dual-Channel Deep Neural Network for Identifying Variable-Length Antiviral Peptides , 2020, IEEE Journal of Biomedical and Health Informatics.

[46]  Søren Brunak,et al.  Non-classical protein secretion in bacteria , 2005, BMC Microbiology.

[47]  Xiangxiang Zeng,et al.  An Evolutionary Algorithm Based on Minkowski Distance for Many-Objective Optimization , 2019, IEEE Transactions on Cybernetics.

[48]  Chao Wang,et al.  FunEffector-Pred: Identification of Fungi Effector by Activate Learning and Genetic Algorithm Sampling of Imbalanced Data , 2020, IEEE Access.

[49]  Tracy Palmer,et al.  The twin-arginine translocation (Tat) protein export pathway , 2012, Nature Reviews Microbiology.

[50]  Guangmin Liang,et al.  SeqSVM: A Sequence-Based Support Vector Machine Method for Identifying Antioxidant Proteins , 2018, International journal of molecular sciences.

[51]  Sen Liang,et al.  A Review of Matched-pairs Feature Selection Methods for Gene Expression Data Analysis , 2018, Computational and structural biotechnology journal.

[52]  Menglong Li,et al.  SecretP: identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition. , 2010, Journal of theoretical biology.

[53]  Geoffrey I. Webb,et al.  iLearn : an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data , 2019, Briefings Bioinform..

[54]  Hui Ding,et al.  Is There Any Sequence Feature in the RNA Pseudouridine Modification Prediction Problem? , 2019, Molecular therapy. Nucleic acids.

[55]  Jiangning Song,et al.  MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters , 2019, Bioinform..

[56]  Geoffrey I. Webb,et al.  DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites , 2019, Bioinform..

[57]  Xiangrong Liu,et al.  On solutions and representations of spiking neural P systems with rules on synapses , 2019, Inf. Sci..

[58]  Xiangxiang Zeng,et al.  Spiking Neural P Systems With Colored Spikes , 2018, IEEE Transactions on Cognitive and Developmental Systems.

[59]  G Schneider,et al.  The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. , 1994, Biophysical journal.

[60]  Guangmin Liang,et al.  An Efficient Classifier for Alzheimer’s Disease Genes Identification , 2018, Molecules.

[61]  Minghui Wang,et al.  SGL-SVM: a novel method for tumor classification via support vector machine with sparse group Lasso. , 2019, Journal of theoretical biology.

[62]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[63]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[64]  G. S. Chhatwal,et al.  Housekeeping enzymes as virulence factors for pathogens. , 2003, International journal of medical microbiology : IJMM.

[65]  Xiaofeng Liu,et al.  Developing a Multi-Dose Computational Model for Drug-Induced Hepatotoxicity Prediction Based on Toxicogenomics Data , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.