ProfilePSTMM: capturing tree-structure motifs in carbohydrate sugar chains

MOTIVATION Carbohydrate sugar chains, or glycans, are considered the third major class of biomolecules after DNA and proteins. They consist of branching monosaccharides, starting from a single monosaccharide. They are extremely vital to the development and functioning of multicellular organisms because they are recognized by various proteins to allow them to perform specific functions. Our motivation is to study this recognition mechanism using informatics techniques from the data available. Previously, we introduced a probabilistic sibling-dependent tree Markov model (PSTMM), which we showed could be efficiently trained on sibling-dependent tree structures and return the most likely state paths. However, it had some limitations in that the extra dependency between siblings caused overfitting problems. The retrieval of the patterns from the trained model also involved manually extracting the patterns from the most likely state paths. Thus we introduce a profilePSTMM model which avoids these problems, incorporating a novel concept of different types of state transitions to handle parent-child and sibling dependencies differently. RESULTS Our new algorithms are more efficient and able to extract the patterns more easily. We tested the profilePSTMM model on both synthetic (controlled) data as well as glycan data from the KEGG GLYCAN database. Additionally, we tested it on glycans which are known to be recognized and bound to proteins at various binding affinities, and we show that our results correlate with results published in the literature.

[1]  Tatsuya Akutsu,et al.  Application of a new probabilistic model for recognizing complex patterns in glycans , 2004, ISMB/ECCB.

[2]  Tatsuya Akutsu,et al.  Efficient tree-matching methods for accurate carbohydrate database queries. , 2003, Genome informatics. International Conference on Genome Informatics.

[3]  J. Holgersson,et al.  Glycosyltransferases involved in type 1 chain and Lewis antigen biosynthesis exhibit glycan and core chain specificity. , 2006, Glycobiology.

[4]  Tatsuya Akutsu,et al.  KCaM (KEGG Carbohydrate Matcher): a software tool for analyzing the structures of carbohydrate sugar chains , 2004, Nucleic Acids Res..

[5]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[6]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[7]  Nicolai V Bovin,et al.  Glycan Array Screening Reveals a Candidate Ligand for Siglec-8* , 2005, Journal of Biological Chemistry.

[8]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[9]  Toshihiko Oka,et al.  Oligosaccharide specificity of galectins: a search by frontal affinity chromatography. , 2002, Biochimica et biophysica acta.

[10]  Andreas Bohne,et al.  SWEET-DB: an attempt to create annotated data collections for carbohydrates , 2002, Nucleic Acids Res..

[11]  Kiyoko F. Aoki-Kinoshita,et al.  KEGG as a glycome informatics resource. , 2006, Glycobiology.

[12]  Kiyoko F. Aoki-Kinoshita,et al.  A global representation of the carbohydrate structures: a tool for the analysis of glycan. , 2005, Genome informatics. International Conference on Genome Informatics.

[13]  J. Hirabayashi,et al.  Carbohydrate Specificity of Lectins from Boletopsis leucomelas and Aralia cordate , 2006, Bioscience, biotechnology, and biochemistry.

[14]  K Bock,et al.  The Complex Carbohydrate Structure Database. , 1989, Trends in biochemical sciences.

[15]  Kiyoko F. Aoki-Kinoshita,et al.  From genomics to chemical genomics: new developments in KEGG , 2005, Nucleic Acids Res..

[16]  James Paulson,et al.  Automatic annotation of matrix‐assisted laser desorption/ionization N‐glycan spectra , 2005, Proteomics.

[17]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  Tatsuya Akutsu,et al.  A probabilistic model for mining labeled ordered trees: capturing patterns in carbohydrate sugar chains , 2005, IEEE Transactions on Knowledge and Data Engineering.

[20]  Tatsuya Akutsu,et al.  A score matrix to reveal the hidden links in glycans , 2005, Bioinform..

[21]  A. Varki,et al.  Sialic acids as ligands in recognition phenomena , 1997, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[22]  Paolo Frasconi,et al.  Hidden Tree Markov Models for Document Image Classification , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  James C Paulson,et al.  Glycan microarray analysis of the hemagglutinins from modern and pandemic influenza viruses reveals different receptor specificities. , 2006, Journal of molecular biology.