Incorporating a transfer learning technique with amino acid embeddings to efficiently predict N-linked glycosylation sites in ion channels

Glycosylation is a dynamic enzymatic process that attaches glycan to proteins or other organic molecules such as lipoproteins. Research has shown that such a process in ion channel proteins plays a fundamental role in modulating ion channel functions. This study used a computational method to predict N-linked glycosylation sites, the most common type, in ion channel proteins. From segments of ion channel proteins centered around N-linked glycosylation sites, the amino acid embedding vectors of each residue were concatenated to create features for prediction. We experimented with two different models for converting amino acids to their corresponding embeddings: one was fed with ion channel sequences and the other with a large dataset composed of more than one million protein sequences. The latter model stemmed from the idea of transfer learning technique and emerged as a more efficient feature extractor. Our best model was obtained from this transfer learning approach and a hyperparameter tuning process with a random search on 5-fold cross-validation data. It achieved an accuracy, specificity, sensitivity, and Matthews correlation coefficient of 93.4%, 92.8%, 98.6%, and 0.726, respectively. Corresponding scores on an independent test were 92.9%, 92.2%, 99%, and 0.717. These results outperform the position-specific scoring matrix features that are predominantly employed in post-translational modification site predictions. Furthermore, compared to N-GlyDE, GlycoEP, SPRINT-Gly, the most recent N-linked glycosylation site predictors, our model yields higher scores on the above 4 metrics, thus further demonstrating the efficiency of our approach.

[1]  Jiang Zhu,et al.  Computational prediction of N-linked glycosylation incorporating structural properties and patterns , 2012, Bioinform..

[2]  Yu-Yen Ou,et al.  Prediction of FAD binding sites in electron transport proteins according to efficient radial basis function networks and significant amino acid pairs , 2016, BMC Bioinformatics.

[3]  Ylva Gavel,et al.  Sequence differences between glycosylated and non-glycosylated Asn-X-Thr/Ser acceptor sites: implications for protein engineering , 1990, Protein engineering.

[4]  A. Gottschalk,et al.  Physiologic and pathophysiologic consequences of altered sialylation and glycosylation on ion channel function. , 2014, Biochemical and biophysical research communications.

[5]  Binh P. Nguyen,et al.  Prediction of FMN Binding Sites in Electron Transport Chains Based on 2-D CNN and PSSM Profiles , 2021, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Weidong Xiao,et al.  Prediction the Substrate Specificities of Membrane Transport Proteins Based on Support Vector Machine and Hybrid Features , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Geoffrey I. Webb,et al.  GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome , 2015, Bioinform..

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  J. Lazniewska,et al.  Glycosylation of voltage-gated calcium channels in health and disease. , 2017, Biochimica et biophysica acta. Biomembranes.

[10]  R. Senger,et al.  Variable Site‐Occupancy Classification of N‐Linked Glycosylation Using Artificial Neural Networks , 2005, Biotechnology progress.

[11]  Yu-Yen Ou,et al.  Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties , 2011, Bioinform..

[12]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinformatics.

[13]  Trinh‐Trung‐Duong Nguyen,et al.  Using Language Representation Learning Approach to Efficiently Identify Protein Complex Categories in Electron Transport Chain , 2020, Molecular informatics.

[14]  Raja Mazumder,et al.  Structure-based Comparative Analysis and Prediction of N-linked Glycosylation Sites in Evolutionarily Distant Eukaryotes , 2013, Genom. Proteom. Bioinform..

[15]  Gerald W. Zamponi,et al.  Targeting voltage-gated calcium channels in neurological and psychiatric diseases , 2015, Nature Reviews Drug Discovery.

[16]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[17]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[18]  Abdollah Dehzangi,et al.  SPRINT-Gly: predicting N- and O-linked glycosylation sites of human and mouse proteins by using sequence and predicted structural properties , 2019, Bioinform..

[19]  Yu-Yen Ou,et al.  DeepEfflux: a 2D convolutional neural network model for identifying families of efflux proteins in transporters , 2018, Bioinform..

[20]  A. Dolphin,et al.  The Physiology, Pathology, and Pharmacology of Voltage-Gated Calcium Channels and Their Future Therapeutic Potential , 2015, Pharmacological Reviews.

[21]  Yu-Yen Ou,et al.  Prediction of ATP-binding sites in membrane proteins using a two-dimensional convolutional neural network. , 2019, Journal of molecular graphics & modelling.

[22]  Hua Tang,et al.  IonchanPred 2.0: A Tool to Predict Ion Channels and Their Types , 2017, International journal of molecular sciences.

[23]  Arvind Kumar Tiwari,et al.  An efficient approach for the prediction of ion channels and their subfamilies , 2015, Comput. Biol. Chem..

[24]  J. Lazniewska,et al.  The "sweet" side of ion channels. , 2014, Reviews of physiology, biochemistry and pharmacology.

[25]  Vasant Honavar,et al.  Glycosylation site prediction using ensembles of Support Vector Machine classifiers , 2007, BMC Bioinformatics.

[26]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[27]  Yasubumi Sakakibara,et al.  Support vector machine prediction of N-and O-glycosylation sites using whole sequence information and subcellular localization , 2009 .

[28]  M. Gromiha,et al.  Classification of transporters using efficient radial basis function networks with position‐specific scoring matrices and biochemical properties , 2010, Proteins.

[29]  Liang Fu,et al.  Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou's PseAAC. , 2013, Protein engineering, design & selection : PEDS.

[30]  Iddo Friedberg,et al.  Identifying antimicrobial peptides using word embedding with deep recurrent neural networks , 2018, bioRxiv.

[31]  Yaser Daanial Khan,et al.  Prediction of N-linked glycosylation sites using position relative features and statistical moments , 2017, PloS one.

[32]  Yu-Yen Ou,et al.  Incorporating efficient radial basis function networks and significant amino acid pairs for predicting GTP binding sites in transport proteins , 2016, BMC Bioinformatics.

[33]  Jonathan D. Hirst,et al.  Prediction of glycosylation sites using random forests , 2008, BMC Bioinformatics.

[34]  Gajendra P. S. Raghava,et al.  In silico Platform for Prediction of N-, O- and C-Glycosites in Eukaryotic Protein Sequences , 2013, PloS one.

[35]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[36]  R. Rao,et al.  Do N-glycoproteins have preference for specific sequons? , 2010, Bioinformation.

[37]  Ting-Yi Sung,et al.  N-GlyDE: a two-stage N-linked glycosylation site prediction incorporating gapped dipeptides and pattern-based encoding , 2019, Scientific Reports.

[38]  Rong Zeng,et al.  Predicting O-glycosylation sites in mammalian proteins by using SVMs , 2006, Comput. Biol. Chem..

[39]  Yu-Yen Ou,et al.  Using word embedding technique to efficiently represent protein sequences for identifying substrate specificities of transporters. , 2019, Analytical biochemistry.

[40]  Patrick X. Zhao,et al.  Prediction of Membrane Transport Proteins and Their Substrate Specificities Using Primary Sequence Information , 2014, PloS one.

[41]  Yu-Yen Ou,et al.  Identifying the molecular functions of electron transport proteins using radial basis function networks and biochemical properties. , 2017, Journal of molecular graphics & modelling.

[42]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[43]  T. Tatusova,et al.  NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2006, Nucleic Acids Research.