Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution

BackgroundGlycoproteins are involved in a diverse range of biochemical and biological processes. Changes in protein glycosylation are believed to occur in many diseases, particularly during cancer initiation and progression. The identification of biomarkers for human disease states is becoming increasingly important, as early detection is key to improving survival and recovery rates. To this end, the serum glycome has been proposed as a potential source of biomarkers for different types of cancers.High-throughput hydrophilic interaction liquid chromatography (HILIC) technology for glycan analysis allows for the detailed quantification of the glycan content in human serum. However, the experimental data from this analysis is compositional by nature. Compositional data are subject to a constant-sum constraint, which restricts the sample space to a simplex. Statistical analysis of glycan chromatography datasets should account for their unusual mathematical properties.As the volume of glycan HILIC data being produced increases, there is a considerable need for a framework to support appropriate statistical analysis. Proposed here is a methodology for feature selection in compositional data. The principal objective is to provide a template for the analysis of glycan chromatography data that may be used to identify potential glycan biomarkers.ResultsA greedy search algorithm, based on the generalized Dirichlet distribution, is carried out over the feature space to search for the set of “grouping variables” that best discriminate between known group structures in the data, modelling the compositional variables using beta distributions. The algorithm is applied to two glycan chromatography datasets. Statistical classification methods are used to test the ability of the selected features to differentiate between known groups in the data. Two well-known methods are used for comparison: correlation-based feature selection (CFS) and recursive partitioning (rpart). CFS is a feature selection method, while recursive partitioning is a learning tree algorithm that has been used for feature selection in the past.ConclusionsThe proposed feature selection method performs well for both glycan chromatography datasets. It is computationally slower, but results in a lower misclassification rate and a higher sensitivity rate than both correlation-based feature selection and the classification tree method.

[1]  Andrew S. Peek,et al.  Improving model predictions for RNA interference activities that use support vector machine regression by combining and filtering features , 2007, BMC Bioinformatics.

[2]  Paul D. McNicholas,et al.  Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models , 2010, Comput. Stat. Data Anal..

[3]  Jack Y. Yang,et al.  A comparative study of different machine learning methods on microarray gene expression data , 2008, BMC Genomics.

[4]  C. M. Jackson,et al.  Variable Selection in Artefact Compositional Studies , 2001 .

[5]  Hongbin Zha,et al.  Dirichlet component analysis: feature extraction for compositional data , 2008, ICML '08.

[6]  Raflq H. Hijazi,et al.  Modelling Compositional Data Using Dirichlet Regression Models , 2007 .

[7]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[8]  B. Lindsay Mixture models : theory, geometry, and applications , 1995 .

[9]  K. Lewis,et al.  A review of novel biological tools used in screening for the early detection of lung cancer , 2009, Postgraduate Medical Journal.

[10]  Thomas Brendan Murphy,et al.  Application of Compositional Models for Glycan HILIC Data , 2011 .

[11]  Raymond J. Owens,et al.  Functional and structural proteomics of glycoproteins , 2011 .

[12]  L. A. Smith,et al.  Feature Subset Selection: A Correlation Based Filter Approach , 1997, ICONIP.

[13]  J. Paulson,et al.  Glycomics: an integrated systems approach to structure-function relationships of glycans , 2005, Nature Methods.

[14]  A. Raftery,et al.  Time Series of Continuous Proportions , 1993 .

[15]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[17]  Pieter Vermeesch,et al.  Tectonic discrimination of basalts with classification trees , 2006 .

[18]  David J. Harvey,et al.  HPLC-based analysis of serum N-glycans on a 96-well plate platform with dedicated database software. , 2008, Analytical biochemistry.

[19]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[20]  B. Lindsay,et al.  The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family , 1994 .

[21]  Yue Fan,et al.  Core fucosylation and alpha2-3 sialylation in serum N-glycome is significantly increased in prostate cancer comparing to benign prostate hyperplasia. , 2011, Glycobiology.

[22]  K. Pearson Mathematical contributions to the theory of evolution.—On a form of spurious correlation which may arise when indices are used in the measurement of organs , 1897, Proceedings of the Royal Society of London.

[23]  Pauline M. Rudd,et al.  Chapter 3:Changes in Serum N-Glycosylation Profiles: Functional Significance and Potential for Diagnostics , 2011 .

[24]  Thomas Brendan Murphy,et al.  Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications. , 2010, The annals of applied statistics.

[25]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[26]  Kiyoko F. Aoki-Kinoshita,et al.  Frontiers in glycomics: Bioinformatics and biomarkers in disease An NIH White Paper prepared from discussions by the focus groups at a workshop on the NIH campus, Bethesda MD (September 11–13, 2006) , 2008, Proteomics.

[27]  Maureen E. Taylor,et al.  Comprar Introduction to Glycobiology | Maureen E. Taylor | 9780199569113 | Oxford University Press , 2011 .

[28]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[29]  A. Jemal,et al.  Global cancer statistics , 2011, CA: a cancer journal for clinicians.

[30]  Robert J. Connor,et al.  Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution , 1969 .

[31]  S. Shen,et al.  The statistical analysis of compositional data , 1983 .

[32]  T. Minka Estimating a Dirichlet distribution , 2012 .

[33]  Louise Royle,et al.  Proposal for a standard system for drawing structural diagrams of N‐ and O‐linked carbohydrates and related compounds , 2009, Proteomics.

[34]  Christina Bougatsos,et al.  Screening for Prostate Cancer: A Review of the Evidence for the U.S. Preventive Services Task Force , 2011, Annals of Internal Medicine.

[35]  Pauline M Rudd,et al.  Novel glycan biomarkers for the detection of lung cancer. , 2011, Journal of proteome research.

[36]  Pauline M. Rudd,et al.  GlycoBase and autoGU: tools for HPLC-based glycan analysis , 2008, Bioinform..

[37]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[38]  Daniel Zelterman,et al.  Dirichlet component regression and its applications to psychiatric data , 2008, Comput. Stat. Data Anal..

[39]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[40]  S. Y. Dennis On the hyper-Dirichlet type 1 and hyper-Liouville distributions , 1991 .

[41]  伏信 進矢 アイオワで computational な夏 , 2007 .

[42]  R. Dwek,et al.  Sequencing of N-linked oligosaccharides directly from protein gels: in-gel deglycosylation followed by matrix-assisted laser desorption/ionization mass spectrometry and normal-phase high-performance liquid chromatography. , 1997, Analytical biochemistry.

[43]  P. Rudd,et al.  Evaluation of the serum N‐linked glycome for the diagnosis of cancer and chronic inflammation , 2008, Proteomics.

[44]  G. Ronning Maximum likelihood estimation of dirichlet distributions , 1989 .

[45]  Bülent Sankur,et al.  Feature selection in the independent component subspace for face recognition , 2004, Pattern Recognit. Lett..

[46]  Maureen E. Taylor,et al.  Introduction to glycobiology , 2003 .

[47]  Pauline M Rudd,et al.  Ultra performance liquid chromatographic profiling of serum N-glycans for fast and efficient identification of cancer associated alterations in glycosylation. , 2010, Analytical chemistry.

[48]  Weston B. Struwe,et al.  Glycoproteomics in Health and Disease , 2010 .

[49]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[50]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[51]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[52]  D. Bates,et al.  Newton-Raphson and EM Algorithms for Linear Mixed-Effects Models for Repeated-Measures Data , 1988 .

[53]  R. Borges,et al.  To transform or not to transform , 2011, Plant signaling & behavior.

[54]  R. Parekh,et al.  Nonselective and efficient fluorescent labeling of glycans using 2-amino benzamide and anthranilic acid. , 1995, Analytical biochemistry.