Prediction of human protein subcellular localization using deep learning

Abstract Protein subcellular localization (PSL), as one of the most critical characteristics of human cells, plays an important role for understanding specific functions and biological processes in cells. Accurate prediction of protein subcellular localization is a fundamental and challenging problem, for which machine learning algorithms have been widely used. Traditionally, the performance of PSL prediction highly depends on handcrafted feature descriptors to represent proteins. In recent years, deep learning has emerged as a hot research topic in the field of machine learning, achieving outstanding success in learning high-level latent features within data samples. In this paper, to accurately predict protein subcellular locations, we propose a deep learning based predictor called DeepPSL by using Stacked Auto-Encoder (SAE) networks. In this predictor, we automatically learn high-level and abstract feature representations of proteins by exploring non-linear relations among diverse subcellular locations, addressing the problem of the need of handcrafted feature representations. Experimental results evaluated with three-fold cross validation show that the proposed DeepPSL outperforms traditional machine learning based methods. It is expected that DeepPSL, as the first predictor in the field of PSL prediction, has great potential to be a powerful computational method complementary to existing tools.

[1]  P. Aloy,et al.  Relation between amino acid composition and cellular location of proteins. , 1997, Journal of molecular biology.

[2]  Sun-Yuan Kung,et al.  HybridGO-Loc: Mining Hybrid Features on Gene Ontology for Predicting Subcellular Localization of Multi-Location Proteins , 2014, PloS one.

[3]  Leyi Wei,et al.  A novel hierarchical selective ensemble classifier with bioinformatics application , 2017, Artif. Intell. Medicine.

[4]  Zhaowei Shang,et al.  Multi-Label Learning With Fuzzy Hypergraph Regularization for Protein Subcellular Location Prediction , 2014, IEEE Transactions on NanoBioscience.

[5]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[6]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[7]  Q. Zou,et al.  Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition , 2016, International journal of molecular sciences.

[8]  B. Liu,et al.  PseDNA‐Pro: DNA‐Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation , 2015, Molecular informatics.

[9]  Gaotao Shi,et al.  CPPred-RF: A Sequence-based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. , 2017, Journal of proteome research.

[10]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[11]  Søren Brunak,et al.  A Neural Network Method for Identification of Prokaryotic and Eukaryotic Signal Peptides and Prediction of their Cleavage Sites , 1997, Int. J. Neural Syst..

[12]  Md. Al Mehedi Hasan,et al.  Feature Fusion Based SVM Classifier for Protein Subcellular Localization Prediction , 2016, J. Integr. Bioinform..

[13]  Jijun Tang,et al.  PhosPred-RF: A Novel Sequence-Based Predictor for Phosphorylation Sites Using Sequential Information Only , 2017, IEEE Transactions on NanoBioscience.

[14]  Dinggang Shen,et al.  Deep Learning-Based Feature Representation for AD/MCI Classification , 2013, MICCAI.

[15]  Zhiyong Chen,et al.  Exploring local discriminative information from evolutionary profiles for cytokine-receptor interaction prediction , 2016, Neurocomputing.

[16]  Naif Alajlan,et al.  Deep learning approach for active classification of electrocardiogram signals , 2016, Inf. Sci..

[17]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[18]  Xiangxiang Zeng,et al.  nDNA-prot: identification of DNA-binding proteins based on unbalanced classification , 2014, BMC Bioinformatics.

[19]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[20]  W. Ansorge Next-generation DNA sequencing techniques. , 2009, New biotechnology.

[21]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[22]  Ran Su,et al.  Identifying N6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine , 2017, Scientific Reports.

[23]  Taeho Jo,et al.  Improving Protein Fold Recognition by Deep Learning Networks , 2015, Scientific Reports.

[24]  Zhiyong Lu,et al.  Predicting subcellular localization of proteins using machine-learned classifiers , 2004, Bioinform..

[25]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[26]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[27]  Kuo-Chen Chou,et al.  A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0. , 2009, Analytical biochemistry.

[28]  P Bork,et al.  Wanted: subcellular localization of proteins based on sequence. , 1998, Trends in cell biology.

[29]  Sun-Yuan Kung,et al.  PairProSVM: Protein Subcellular Localization Based on Local Pairwise Profile Alignment and SVM , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Q. Zou,et al.  Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier , 2013, PloS one.

[31]  M. Marra,et al.  Applications of next-generation sequencing technologies in functional genomics. , 2008, Genomics.

[32]  Xiangrong Liu,et al.  Implementation of Arithmetic Operations With Time-Free Spiking Neural P Systems , 2015, IEEE Transactions on NanoBioscience.

[33]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[34]  R. Ji,et al.  Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[36]  Michel Schneider,et al.  UniProtKB/Swiss-Prot. , 2007, Methods in molecular biology.

[37]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[38]  Tao Jiang,et al.  TITER: predicting translation initiation sites by deep learning , 2017, bioRxiv.

[39]  Sridhar Krishnan,et al.  Combining least-squares support vector machines for classification of biomedical signals: a case study with knee-joint vibroarthrographic signals , 2011, J. Exp. Theor. Artif. Intell..

[40]  Steve Renals,et al.  Convolutional Neural Networks for Distant Speech Recognition , 2014, IEEE Signal Processing Letters.

[41]  Gaotao Shi,et al.  Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[42]  Ming Wen,et al.  Deep-Learning-Based Drug-Target Interaction Prediction. , 2017, Journal of proteome research.

[43]  B. Liu,et al.  An Approach for Identifying Cytokines Based on a Novel Ensemble Classifier , 2013, BioMed research international.

[44]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..