Prediction of N-linked glycosylation sites using position relative features and statistical moments

Glycosylation is one of the most complex post translation modification in eukaryotic cells. Almost 50% of the human proteome is glycosylated as glycosylation plays a vital role in various biological functions such as antigen’s recognition, cell-cell communication, expression of genes and protein folding. It is a significant challenge to identify glycosylation sites in protein sequences as experimental methods are time taking and expensive. A reliable computational method is desirable for the identification of glycosylation sites. In this study, a comprehensive technique for the identification of N-linked glycosylation sites has been proposed using machine learning. The proposed predictor was trained using an up-to-date dataset through back propagation algorithm for multilayer neural network. The results of ten-fold cross-validation and other performance measures such as accuracy, sensitivity, specificity and Mathew’s correlation coefficient inferred that the accuracy of proposed tool is far better than the existing systems such as Glyomine, GlycoEP, Ensemble SVM and GPP.

[1]  S. Liang,et al.  A support vector machine based pharmacodynamic prediction model for searching active fraction and ingredients of herbal medicine: Naodesheng prescription as an example. , 2011, Journal of pharmaceutical and biomedical analysis.

[2]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[3]  Jonathan D. Hirst,et al.  Prediction of glycosylation sites using random forests , 2008, BMC Bioinformatics.

[4]  R. C. Papademetriou,et al.  Reconstructing with moments , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol. III. Conference C: Image, Speech and Signal Analysis,.

[5]  Gajendra P. S. Raghava,et al.  In silico Platform for Prediction of N-, O- and C-Glycosites in Eukaryotic Protein Sequences , 2013, PloS one.

[6]  Wei Chen,et al.  Detecting N6-methyladenosine sites from RNA transcriptomes using ensemble Support Vector Machines , 2017, Scientific Reports.

[7]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[8]  Junjie Chen,et al.  A comprehensive review and comparison of different computational methods for protein remote homology detection , 2018, Briefings Bioinform..

[9]  Joost J. J. van Durme,et al.  Accurate Prediction of DnaK-Peptide Binding via Homology Modelling and Experimental Data , 2009, PLoS Comput. Biol..

[10]  Vasant Honavar,et al.  Glycosylation site prediction using ensembles of Support Vector Machine classifiers , 2007, BMC Bioinformatics.

[11]  Farooq Ahmad,et al.  A Neuro-Cognitive Approach for Iris Recognition Using Back Propagation , 2012 .

[12]  Geoffrey I. Webb,et al.  GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome , 2015, Bioinform..

[13]  Claus Lundegaard,et al.  NetTurnP – Neural Network Prediction of Beta-turns by Use of Evolutionary Information and Predicted Protein Sequence Features , 2010, PloS one.

[14]  David J. Olive,et al.  Introduction to Regression Analysis , 2007 .

[15]  Huazhong Shu,et al.  Image analysis by discrete orthogonal dual Hahn moments , 2007, Pattern Recognit. Lett..

[16]  Cesare Furlanello,et al.  A Comparison of MCC and CEN Error Measures in Multi-Class Prediction , 2010, PloS one.

[17]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[18]  Geoffrey E. Hinton,et al.  Learning representations of back-propagation errors , 1986 .

[19]  Hon-Son Don,et al.  3-D Moment Forms: Their Construction and Application to Object Identification and Positioning , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Ahmad Hassan Butt,et al.  A Treatise to Computational Approaches Towards Prediction of Membrane Protein and Its Subtypes , 2016, The Journal of Membrane Biology.

[21]  Sher Afzal Khan,et al.  A Prediction Model for Membrane Proteins Using Moments Based Features , 2016, BioMed research international.

[22]  Xuan Liu,et al.  Identification of DNA-Binding Proteins by Combining Auto-Cross Covariance Transformation and Ensemble Learning , 2016, IEEE Transactions on NanoBioscience.

[23]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[24]  K. Chou,et al.  iSNO-PseAAC: Predict Cysteine S-Nitrosylation Sites in Proteins by Incorporating Position Specific Amino Acid Propensity into Pseudo Amino Acid Composition , 2013, PloS one.

[25]  Limin Jiang,et al.  BP Neural Network Could Help Improve Pre-miRNA Identification in Various Species , 2016, BioMed research international.

[26]  Farooq Ahmad,et al.  Iris Recognition Using Image Moments and k-Means Algorithm , 2014, TheScientificWorldJournal.

[27]  A. Helenius,et al.  Intracellular functions of N-linked glycans. , 2001, Science.

[28]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[29]  Xiaolong Wang,et al.  iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach , 2016, Journal of biomolecular structure & dynamics.

[30]  B. Liu,et al.  Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods , 2017, Oncotarget.

[31]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[32]  Farooq Ahmad,et al.  An Efficient Algorithm for Recognition of Human Actions , 2014, TheScientificWorldJournal.

[33]  B. Liu,et al.  Identification of Real MicroRNA Precursors with a Pseudo Structure Status Composition Approach , 2015, PloS one.

[34]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[35]  O. Lund,et al.  NetOglyc: Prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility , 1998, Glycoconjugate Journal.

[36]  Jie Yuan,et al.  Chemometrics‐Based Approach to Feature Selection of Chromatographic Profiles and its Application to Search Active Fraction of Herbal Medicine , 2013, Chemical biology & drug design.

[37]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[38]  Fahima Tabassum,et al.  Identification of Fingerprint using Discrete Wavelet , 2015 .

[39]  Duncan P. Brown,et al.  Automated Protein Subfamily Identification and Classification , 2007, PLoS Comput. Biol..

[40]  Markus Aebi,et al.  N-linked protein glycosylation in the ER. , 2013, Biochimica et biophysica acta.

[41]  Ziding Zhang,et al.  Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs , 2008, BMC Bioinformatics.

[42]  R. Elliott,et al.  Role of N-Linked Glycans on Bunyamwera Virus Glycoproteins in Intracellular Trafficking, Protein Folding, and Virus Infectivity , 2005, Journal of Virology.

[43]  Ruedi Aebersold,et al.  Identification and quantification of N-linked glycoproteins using hydrazide chemistry, stable isotope labeling and mass spectrometry , 2003, Nature Biotechnology.

[44]  Ren Long,et al.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[45]  Ronald,et al.  Learning representations by backpropagating errors , 2004 .

[46]  R. Dwek,et al.  Concepts and principles of O-linked glycosylation. , 1998, Critical reviews in biochemistry and molecular biology.