Glycosylation site prediction using ensembles of Support Vector Machine classifiers

BackgroundGlycosylation is one of the most complex post-translational modifications (PTMs) of proteins in eukaryotic cells. Glycosylation plays an important role in biological processes ranging from protein folding and subcellular localization, to ligand recognition and cell-cell interactions. Experimental identification of glycosylation sites is expensive and laborious. Hence, there is significant interest in the development of computational methods for reliable prediction of glycosylation sites from amino acid sequences.ResultsWe explore machine learning methods for training classifiers to predict the amino acid residues that are likely to be glycosylated using information derived from the target amino acid residue and its sequence neighbors. We compare the performance of Support Vector Machine classifiers and ensembles of Support Vector Machine classifiers trained on a dataset of experimentally determined N-linked, O-linked, and C-linked glycosylation sites extracted from O-GlycBase version 6.00, a database of 242 proteins from several different species. The results of our experiments show that the ensembles of Support Vector Machine classifiers outperform single Support Vector Machine classifiers on the problem of predicting glycosylation sites in terms of a range of standard measures for comparing the performance of classifiers. The resulting methods have been implemented in EnsembleGly, a web server for glycosylation site prediction.ConclusionEnsembles of Support Vector Machine classifiers offer an accurate and reliable approach to automated identification of putative glycosylation sites in glycoprotein sequences.

[1]  Vasant Honavar,et al.  Assessing the Performance of Macromolecular Sequence Classifiers , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[2]  Ylva Gavel,et al.  Sequence differences between glycosylated and non-glycosylated Asn-X-Thr/Ser acceptor sites: implications for protein engineering , 1990, Protein engineering.

[3]  Suzanne Miyamoto,et al.  Clinical applications of glycomic approaches for the detection of cancer and other diseases. , 2006, Current opinion in molecular therapeutics.

[4]  Zheng Rong Yang,et al.  Bio-basis function neural network for prediction of protease cleavage sites in proteins , 2005, IEEE Transactions on Neural Networks.

[5]  Søren Brunak,et al.  O-GLYCBASE version 4.0: a revised database of O-glycosylated proteins , 1999, Nucleic Acids Res..

[6]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.

[7]  Bermseok Oh,et al.  Prediction of phosphorylation sites using SVMs , 2004, Bioinform..

[8]  G von Heijne,et al.  Amino acid distributions around O-linked glycosylation sites. , 1991, The Biochemical journal.

[9]  E. F. ARMSTRONG,et al.  Annual Review of Biochemistry , 1944, Nature.

[10]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[11]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[12]  Martin Frank,et al.  Bioinformatics for glycomics: Status, methods, requirements and perspectives , 2004, Briefings Bioinform..

[13]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[14]  David Chiu,et al.  BOOK REVIEW: "PATTERN CLASSIFICATION", R. O. DUDA, P. E. HART and D. G. STORK, Second Edition , 2001 .

[15]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[16]  P. Bork,et al.  Prediction of potential GPI-modification sites in proprotein sequences. , 1999, Journal of molecular biology.

[17]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[18]  Bernard Manderick,et al.  Substitution matrix based kernel functions for protein secondary structure prediction , 2004, 2004 International Conference on Machine Learning and Applications, 2004. Proceedings..

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[21]  Søren Brunak,et al.  Prediction of Glycosylation Across the Human Proteome and the Correlation to Protein Function , 2001, Pacific Symposium on Biocomputing.

[22]  J. Lowe,et al.  Role of glycosylation in development. , 2003, Annual review of biochemistry.

[23]  Zheng Rong Yang,et al.  RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins , 2005, Bioinform..

[24]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[25]  N. Blom,et al.  Prediction of post‐translational glycosylation and phosphorylation of proteins from the amino acid sequence , 2004, Proteomics.

[26]  O. Lund,et al.  Prediction of O-glycosylation of mammalian proteins: specificity patterns of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase. , 1995, The Biochemical journal.

[27]  Lara K Mahal,et al.  Deciphering the glycocode: the complexity and analytical challenge of glycomics. , 2007, Current opinion in chemical biology.

[28]  R. Dwek,et al.  Biological importance of glycosylation. , 1998, Developments in biological standardization.

[29]  O. Jensen Interpreting the protein language using proteomics , 2006, Nature Reviews Molecular Cell Biology.

[30]  J. Konopka,et al.  Mutational analysis of the role of N-glycosylation in alpha-factor receptor function. , 2001, Biochemistry.

[31]  T. H. Thanka Christlet,et al.  Database analysis of O-glycosylation sites in proteins. , 2001, Biophysical journal.

[32]  Amos Bairoch,et al.  The SWISS-PROT protein sequence data bank, recent developments , 1993, Nucleic Acids Res..

[33]  R. Poorman,et al.  The specificity of UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase as inferred from a database of in vivo substrates and from the in vitro glycosylation of proteins and peptides. , 1993, The Journal of biological chemistry.

[34]  David G. Stork,et al.  Pattern Classification , 1973 .

[35]  J. Hofsteenge,et al.  Recognition signal for C-mannosylation of Trp-7 in RNase 2 consists of sequence Trp-x-x-Trp. , 1998, Molecular biology of the cell.

[36]  Thomas G. Dietterich Machine Learning for Sequential Data: A Review , 2002, SSPR/SPR.

[37]  A. Varki,et al.  Biological roles of oligosaccharides: all of the theories are correct , 1993, Glycobiology.

[38]  Xian-Ming Pan,et al.  New method for accurate prediction of solvent accessibility from protein sequence , 2001, Proteins.

[39]  J. Hofsteenge,et al.  Protein C-mannosylation is enzyme-catalysed and uses dolichyl-phosphate-mannose as a precursor. , 1998, Molecular biology of the cell.

[40]  R. Russell,et al.  Amino‐Acid Properties and Consequences of Substitutions , 2003 .

[41]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[42]  Rong Zeng,et al.  Predicting O-glycosylation sites in mammalian proteins by using SVMs , 2006, Comput. Biol. Chem..

[43]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.