A Machine Learning Based Approach to de novo Sequencing of Glycans from Tandem Mass Spectrometry Spectrum

Recently, glycomics has been actively studied and various technologies for glycomics have been rapidly developed. Currently, tandem mass spectrometry (MS/MS) is one of the key experimental tools for identification of structures of oligosaccharides. MS/MS can observe MS/MS peaks of fragmented glycan ions including cross-ring ions resulting from internal cleavages, which provide valuable information to infer glycan structures. Thus, the aim of de novo sequencing of glycans is to find the most probable assignments of observed MS/MS peaks to glycan substructures without databases. However, there are few satisfiable algorithms for glycan de novo sequencing from MS/MS spectra. We present a machine learning based approach to de novo sequencing of glycans from MS/MS spectrum. First, we build a suitable model for the fragmentation of glycans including cross-ring ions, and implement a solver that employs Lagrangian relaxation with a dynamic programming technique. Then, to optimize scores for the algorithm, we introduce a machine learning technique called structured support vector machines that enable us to learn parameters including scores for cross-ring ions from training data, i.e., known glycan mass spectra. Furthermore, we implement additional constraints for core structures of well-known glycan types including N-linked glycans and O-linked glycans. This enables us to predict more accurate glycan structures if the glycan type of given spectra is known. Computational experiments show that our algorithm performs accurate de novo sequencing of glycans. The implementation of our algorithm and the datasets are available at http://glyfon.dna.bio.keio.ac.jp/.

[1]  Jens Vygen,et al.  The Book Review Column1 , 2020, SIGACT News.

[2]  Tatsuya Akutsu,et al.  DAFS: simultaneous aligning and folding of RNA sequences via dual decomposition , 2012, Bioinform..

[3]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[4]  R Apweiler,et al.  On the frequency of protein glycosylation, as deduced from analysis of the SWISS-PROT database. , 1999, Biochimica et biophysica acta.

[5]  Pauline M Rudd,et al.  Glycans as cancer biomarkers. , 2012, Biochimica et biophysica acta.

[6]  Niclas G Karlsson,et al.  Development of a mass fingerprinting tool for automated interpretation of oligosaccharide fragmentation data , 2004, Proteomics.

[7]  Haixu Tang,et al.  Automated interpretation of MS/MS spectra of oligosaccharides , 2005, ISMB.

[8]  Claus-Wilhelm von der Lieth,et al.  GlycoFragment and GlycoSearchMS: web tools to support the interpretation of mass spectra of complex carbohydrates , 2004, Nucleic Acids Res..

[9]  Knut Reinert,et al.  Antilope—A Lagrangian Relaxation Approach to the de novo Peptide Sequencing Problem , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Pavel A. Pevzner,et al.  De Novo Peptide Sequencing via Tandem Mass Spectrometry , 1999, J. Comput. Biol..

[11]  C. Bartels Fast algorithm for peptide sequencing by mass spectroscopy. , 1990, Biomedical & environmental mass spectrometry.

[12]  P. Pevzner,et al.  PepNovo: de novo peptide sequencing via probabilistic network modeling. , 2005, Analytical chemistry.

[13]  Hudson H. Freeze,et al.  Genetic defects in the human glycome , 2006, Nature Reviews Genetics.

[14]  Knut Reinert,et al.  Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization , 2007, BMC Bioinformatics.

[15]  Kiyoko F. Aoki-Kinoshita,et al.  UniCarbKB: building a knowledge platform for glycoproteomics , 2013, Nucleic Acids Res..

[16]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[17]  H. Perreault,et al.  Application of the StrOligo algorithm for the automated structure assignment of complex N-linked glycans from glycoproteins using tandem mass spectrometry. , 2003, Rapid communications in mass spectrometry : RCM.

[18]  J. Leary,et al.  STAT: a saccharide topology analysis tool used in combination with tandem mass spectrometry. , 2000, Analytical chemistry.

[19]  Alessio Ceroni,et al.  GlycoWorkbench: a tool for the computer-assisted annotation of mass spectra of glycans. , 2008, Journal of proteome research.

[20]  H. Klenk,et al.  Functional balance between haemagglutinin and neuraminidase in influenza virus infections , 2002, Reviews in medical virology.

[21]  Rene Ranzinger,et al.  The GlycanBuilder and GlycoWorkbench glycoinformatics tools: updates and new developments , 2012, Biological chemistry.

[22]  R. Spiro Protein glycosylation: nature, distribution, enzymatic formation, and disease implications of glycopeptide bonds. , 2002, Glycobiology.

[23]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[24]  Bin Ma,et al.  Complexities and Algorithms for Glycan Sequencing Using Tandem Mass Spectrometry , 2008, J. Bioinform. Comput. Biol..