Predicting subcellular location of protein with evolution information and sequence-based deep learning

Background Protein subcellular localization prediction plays an important role in biology research. Since traditional methods are laborious and time-consuming, many machine learning-based prediction methods have been proposed. However, most of the proposed methods ignore the evolution information of proteins. In order to improve the prediction accuracy, we present a deep learning-based method to predict protein subcellular locations. Results Our method utilizes not only amino acid compositions sequence but also evolution matrices of proteins. Our method uses a bidirectional long short-term memory network that processes the entire protein sequence and a convolutional neural network that extracts features from protein sequences. The position specific scoring matrix is used as a supplement to protein sequences. Our method was trained and tested on two benchmark datasets. The experiment results show that our method yields accurate results on the two datasets with an average precision of 0.7901, ranking loss of 0.0758 and coverage of 1.2848. Conclusion The experiment results show that our method outperforms five methods currently available. According to those experiments, we can see that our method is an acceptable alternative to predict protein subcellular location. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04404-0.

[1]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[2]  Yiming Ying,et al.  Multi-kernel regularized classifiers , 2007, J. Complex..

[3]  R. Micura,et al.  Distinct 5-methylcytosine profiles in poly(A) RNA from mouse embryonic stem cells and brain , 2017, Genome Biology.

[4]  Xing Gao,et al.  mGOF-loc: A novel ensemble learning method for human protein subcellular localization prediction , 2016, Neurocomputing.

[5]  Zhen Cao,et al.  The lncLocator: a subcellular localization predictor for long non‐coding RNAs based on a stacked ensemble classifier , 2018, Bioinform..

[6]  O. Stegle,et al.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2016, Genome Biology.

[7]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[8]  Jijun Tang,et al.  Human protein subcellular localization identification via fuzzy model on Kernelized Neighborhood Representation , 2020, Appl. Soft Comput..

[9]  K. Chou,et al.  pLoc_Deep-mHum: Predict Subcellular Localization of Human Proteins by Deep Learning , 2020, Natural Science.

[10]  Claudio Moraga,et al.  The Influence of the Sigmoid Function Parameters on the Speed of Backpropagation Learning , 1995, IWANN.

[11]  K. Chou Advance in predicting subcellular localization of multi-label proteins and its implication for developing multi-target drugs. , 2019, Current medicinal chemistry.

[12]  Luhua Lai,et al.  Sequence-based prediction of protein protein interaction using a deep-learning algorithm , 2017, BMC Bioinformatics.

[13]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[14]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[15]  Wenqi Liu,et al.  Imbalanced Multi-Modal Multi-Label Learning for Subcellular Localization Prediction of Human Proteins with Both Single and Multiple Sites , 2012, PloS one.

[16]  Jijun Tang,et al.  Prediction of human protein subcellular localization using deep learning , 2017, J. Parallel Distributed Comput..

[17]  Ole Winther,et al.  DeepLoc: prediction of protein subcellular localization using deep learning , 2017, Bioinform..

[18]  K. Chou,et al.  pLoc_Deep-mVirus: A CNN Model for Predicting Subcellular Localization of Virus Proteins by Deep Learning , 2020 .

[19]  Jijun Tang,et al.  Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC. , 2019, Journal of theoretical biology.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[22]  Shinichiro Taguchi,et al.  Efficient partition of integer optimization problems with one-hot encoding , 2019, Scientific Reports.

[23]  Mathieu Blanchette,et al.  Prediction of mRNA subcellular localization using deep recurrent neural networks , 2019, Bioinform..

[24]  R. Tsien,et al.  green fluorescent protein , 2020, Catalysis from A to Z.

[25]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[26]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[27]  Kuo-Chen Chou,et al.  pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. , 2017, Genomics.

[28]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[29]  M. Kanehisa,et al.  A knowledge base for predicting protein localization sites in eukaryotic cells , 1992, Genomics.

[30]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[31]  Teresa J. Feo,et al.  Structural absorption by barbule microstructures of super black bird of paradise feathers , 2018, Nature Communications.

[32]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[33]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[34]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[35]  A. Hodgkin,et al.  A quantitative description of membrane current and its application to conduction and excitation in nerve , 1952, The Journal of physiology.

[36]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[37]  J. Gardy,et al.  Assessing the precision of high-throughput computational and laboratory approaches for the genome-wide identification of protein subcellular localization in bacteria , 2005, BMC Genomics.

[38]  Aurélien Géron,et al.  Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems , 2017 .

[39]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[40]  Shigeo Abe,et al.  Fuzzy support vector machines for multiclass problems , 2002, ESANN.

[41]  Shraddha Ravindra Masurkar,et al.  Human Protein Subcellular Localization using Convolutional Neural Network as Feature Extractor , 2019, 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT).

[42]  Sun-Yuan Kung,et al.  mLASSO-Hum: A LASSO-based interpretable human-protein subcellular localization predictor. , 2015, Journal of theoretical biology.

[43]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[44]  Fei Guo,et al.  Critical evaluation of web-based prediction tools for human protein subcellular localization , 2019, Briefings Bioinform..

[45]  Kuo-Chen Chou,et al.  pLoc_bal‐mAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC , 2018, Bioinform..

[46]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[47]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[48]  Neil D. Lawrence,et al.  Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data , 2003, NIPS.

[49]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[50]  Sheng-De Wang,et al.  Fuzzy support vector machines , 2002, IEEE Trans. Neural Networks.

[51]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[52]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[53]  Kuo-Chen Chou,et al.  A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0. , 2009, Analytical biochemistry.

[54]  Jijun Tang,et al.  Protein Crystallization Identification via Fuzzy Model on Linear Neighborhood Representation , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[55]  Q. Zou,et al.  Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier , 2013, PloS one.

[56]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[57]  Dongmei Li,et al.  Bon-EV: an improved multiple testing procedure for controlling false discovery rates , 2017, BMC Bioinformatics.

[58]  Hong-Bin Shen,et al.  ImPLoc: a multi-instance deep learning model for the prediction of protein subcellular localization based on immunohistochemistry images , 2019, Bioinform..

[59]  Liangjiang Wang,et al.  Prediction of LncRNA Subcellular Localization with Deep Learning from Sequence Features , 2018, Scientific Reports.

[60]  G. Karp Cell and molecular biology : concepts and experiments / Gerald Karp , 1996 .

[61]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[62]  Tao Xu,et al.  Deep Convolutional Neural Network Based ECG Classification System Using Information Fusion and One-Hot Encoding Techniques , 2018, Mathematical Problems in Engineering.

[63]  K. Chou,et al.  Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. , 2007, Biochemical and biophysical research communications.

[64]  Richard Hans Robert Hahnloser,et al.  Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit , 2000, Nature.

[65]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[66]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[67]  Kuo-Chen Chou,et al.  pLoc_Deep-mEuk: Predict Subcellular Localization of Eukaryotic Proteins by Deep Learning , 2020, Natural Science.

[68]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[69]  Shuye Tian,et al.  Modern deep learning in bioinformatics , 2020, Journal of molecular cell biology.

[70]  Wei Pan,et al.  Towards Accurate Binary Convolutional Neural Network , 2017, NIPS.

[71]  Bo Zhang,et al.  Enhancing Hi-C data resolution with deep convolutional neural network HiCPlus , 2018, Nature Communications.

[72]  Hong-Bin Shen,et al.  Hum‐mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features , 2016, Bioinform..

[73]  Chris Hans Bayesian lasso regression , 2009 .

[74]  J. Gardy,et al.  Methods for predicting bacterial protein subcellular localization , 2006, Nature Reviews Microbiology.

[75]  Leopold Parts,et al.  Accurate Classification of Protein Subcellular Localization from High-Throughput Microscopy Images Using Deep Learning , 2016, G3: Genes, Genomes, Genetics.

[76]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.