Glycoconjugates constitute a major class of biomolecules which include glycoproteins, glycosphingolipids
and proteoglycans. The enzymatic process in which glycans (sugar chains) are linked to
proteins or lipids is called glycosylation. Glycosylation is involved in many biological processes, both
physiological and pathological, inlcuding host-pathogen interactions, tumour invasion, cell trafficking
and signalling. Changes in glycan structure are thought be be at least partly responsible for the development
of inflammation, infection, arteriosclerosis, immune defects and autoimmunity. Such changes
have been observed in human diseases such as diabetes mellitus, rheumatoid arthritis and Alzheimer’s
Disease. Aberrant patterns of glycosylation are also a universal feature of cancer cells. The field of
glycobiology thus shows great potential for the discovery of glycan biomarkers for disease diagnosis
and prognosis.
Here we focus specifically on N-glycans, that is, glycans attached to protein molecules via a
nitrogen atom. This class of glycans is the best characterized. High-throughput HILIC analysis is
a well-established technique for the separation and quantification of N-linked glycans released from
glycoproteins. HILIC analysis quantifies the N-glycan structures in serum via a chromatogram, which
is subsequently standardized and integrated. The generated data for each sample is a set of relative
HILIC peak areas and as a result, the data is compositional. To-date, most statistical analyses of these
glycan data fail to account for their compositional nature.
We compare and contrast three compositional data models for the glycan HILIC data: the Dirichlet,
Nested Dirichlet and Logistic Normal models, with the intention of providing tools for the statistical
analysis of compositional data analysis in the glycobiology field. We use these three models for
classification of disease/control cases in ovarian and lung cancer diagnosis applications. We discuss
and compare these models in terms of their classification performance and goodness-of-fit
[1]
G. Schwarz.
Estimating the Dimension of a Model
,
1978
.
[2]
Null Brad.
Modeling Baseball Player Ability with a Nested Dirichlet Distribution
,
2009
.
[3]
David J. Harvey,et al.
HPLC-based analysis of serum N-glycans on a 96-well plate platform with dedicated database software.
,
2008,
Analytical biochemistry.
[4]
William M. Rand,et al.
Objective Criteria for the Evaluation of Clustering Methods
,
1971
.
[5]
Maureen E. Taylor,et al.
Introduction to glycobiology
,
2003
.
[6]
T. Minka.
Estimating a Dirichlet distribution
,
2012
.
[7]
Pauline M Rudd,et al.
Novel glycan biomarkers for the detection of lung cancer.
,
2011,
Journal of proteome research.
[8]
R. Dwek,et al.
Sequencing of N-linked oligosaccharides directly from protein gels: in-gel deglycosylation followed by matrix-assisted laser desorption/ionization mass spectrometry and normal-phase high-performance liquid chromatography.
,
1997,
Analytical biochemistry.
[9]
G. Ronning.
Maximum likelihood estimation of dirichlet distributions
,
1989
.
[10]
M. Weinblatt,et al.
Aberrant IgG galactosylation precedes disease onset, correlates with disease activity, and is prevalent in autoantibodies in rheumatoid arthritis.
,
2010,
Arthritis and rheumatism.
[11]
Pauline M Rudd,et al.
Ovarian cancer is associated with changes in glycosylation in both acute-phase proteins and IgG.
,
2007,
Glycobiology.
[12]
J. Atchison,et al.
Logistic-normal distributions:Some properties and uses
,
1980
.