Application of Compositional Models for Glycan HILIC Data

Glycoconjugates constitute a major class of biomolecules which include glycoproteins, glycosphingolipids and proteoglycans. The enzymatic process in which glycans (sugar chains) are linked to proteins or lipids is called glycosylation. Glycosylation is involved in many biological processes, both physiological and pathological, inlcuding host-pathogen interactions, tumour invasion, cell trafficking and signalling. Changes in glycan structure are thought be be at least partly responsible for the development of inflammation, infection, arteriosclerosis, immune defects and autoimmunity. Such changes have been observed in human diseases such as diabetes mellitus, rheumatoid arthritis and Alzheimer’s Disease. Aberrant patterns of glycosylation are also a universal feature of cancer cells. The field of glycobiology thus shows great potential for the discovery of glycan biomarkers for disease diagnosis and prognosis. Here we focus specifically on N-glycans, that is, glycans attached to protein molecules via a nitrogen atom. This class of glycans is the best characterized. High-throughput HILIC analysis is a well-established technique for the separation and quantification of N-linked glycans released from glycoproteins. HILIC analysis quantifies the N-glycan structures in serum via a chromatogram, which is subsequently standardized and integrated. The generated data for each sample is a set of relative HILIC peak areas and as a result, the data is compositional. To-date, most statistical analyses of these glycan data fail to account for their compositional nature. We compare and contrast three compositional data models for the glycan HILIC data: the Dirichlet, Nested Dirichlet and Logistic Normal models, with the intention of providing tools for the statistical analysis of compositional data analysis in the glycobiology field. We use these three models for classification of disease/control cases in ovarian and lung cancer diagnosis applications. We discuss and compare these models in terms of their classification performance and goodness-of-fit

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  Null Brad Modeling Baseball Player Ability with a Nested Dirichlet Distribution , 2009 .

[3]  David J. Harvey,et al.  HPLC-based analysis of serum N-glycans on a 96-well plate platform with dedicated database software. , 2008, Analytical biochemistry.

[4]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[5]  Maureen E. Taylor,et al.  Introduction to glycobiology , 2003 .

[6]  T. Minka Estimating a Dirichlet distribution , 2012 .

[7]  Pauline M Rudd,et al.  Novel glycan biomarkers for the detection of lung cancer. , 2011, Journal of proteome research.

[8]  R. Dwek,et al.  Sequencing of N-linked oligosaccharides directly from protein gels: in-gel deglycosylation followed by matrix-assisted laser desorption/ionization mass spectrometry and normal-phase high-performance liquid chromatography. , 1997, Analytical biochemistry.

[9]  G. Ronning Maximum likelihood estimation of dirichlet distributions , 1989 .

[10]  M. Weinblatt,et al.  Aberrant IgG galactosylation precedes disease onset, correlates with disease activity, and is prevalent in autoantibodies in rheumatoid arthritis. , 2010, Arthritis and rheumatism.

[11]  Pauline M Rudd,et al.  Ovarian cancer is associated with changes in glycosylation in both acute-phase proteins and IgG. , 2007, Glycobiology.

[12]  J. Atchison,et al.  Logistic-normal distributions:Some properties and uses , 1980 .