The application of network label propagation to rank biomarkers in genome-wide Alzheimer’s data

BackgroundRanking and identifying biomarkers that are associated with disease from genome-wide measurements holds significant promise for understanding the genetic basis of common diseases. The large number of single nucleotide polymorphisms (SNPs) in genome-wide studies (GWAS), however, makes this task computationally challenging when the ranking is to be done in a multivariate fashion. This paper evaluates the performance of a multivariate graph-based method called label propagation (LP) that efficiently ranks SNPs in genome-wide data.ResultsThe performance of LP was evaluated on a synthetic dataset and two late onset Alzheimer’s disease (LOAD) genome-wide datasets, and the performance was compared to that of three control methods. The control methods included chi squared, which is a commonly used univariate method, as well as a Relief method called SWRF and a sparse logistic regression (SLR) method, which are both multivariate ranking methods. Performance was measured by evaluating the top-ranked SNPs in terms of classification performance, reproducibility between the two datasets, and prior evidence of being associated with LOAD.On the synthetic data LP performed comparably to the control methods. On GWAS data, LP performed significantly better than chi squared and SWRF in classification performance in the range from 10 to 1000 top-ranked SNPs for both datasets, and not significantly different from SLR. LP also had greater ranking reproducibility than chi squared, SWRF, and SLR. Among the 25 top-ranked SNPs that were identified by LP, there were 14 SNPs in one dataset that had evidence in the literature of being associated with LOAD, and 10 SNPs in the other, which was higher than for the other methods.ConclusionLP performed considerably better in ranking SNPs in two high-dimensional genome-wide datasets when compared to three control methods. It had better performance in the evaluation measures we used, and is computationally efficient to be applied practically to data from genome-wide studies. These results provide support for including LP in the methods that are used to rank SNPs in genome-wide datasets.

[1]  D. G. Clark,et al.  Common variants in MS4A4/MS4A6E, CD2uAP, CD33, and EPHA1 are associated with late-onset Alzheimer’s disease , 2011, Nature Genetics.

[2]  Winnie S. Liang,et al.  GAB2 Alleles Modify Alzheimer's Risk in APOE ɛ4 Carriers , 2007, Neuron.

[3]  Vipin Kumar,et al.  Robust and efficient identification of biomarkers by classifying features on graphs , 2008, Bioinform..

[4]  M. Spitz,et al.  Shifting paradigm of association studies: value of rare single-nucleotide polymorphisms. , 2008, American journal of human genetics.

[5]  Jason H. Moore,et al.  Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions , 2009, BioData Mining.

[6]  Nick C Fox,et al.  Common variants in ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are associated with Alzheimer’s disease , 2011, Nature Genetics.

[7]  Masa-aki Sato,et al.  Sparse estimation automatically selects voxels relevant for the decoding of fMRI activity patterns , 2008, NeuroImage.

[8]  Thomas D. Bird,et al.  Alzheimer Disease Overview , 2015 .

[9]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[10]  M A Pericak-Vance,et al.  Genome-wide association study of Alzheimer's disease , 2012, Translational Psychiatry.

[11]  Baolin Wu,et al.  Signed network propagation for detecting differential gene expressions and DNA copy number variations , 2012, BCB.

[12]  K. Frazer,et al.  Common vs. rare allele hypotheses for complex diseases. , 2009, Current opinion in genetics & development.

[13]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[14]  D. Lancet,et al.  GeneCards: integrating information about genes, proteins and diseases. , 1997, Trends in genetics : TIG.

[15]  Reiji Teramoto Prediction of Alzheimer's diagnosis using semi-supervised distance metric learning with label propagation , 2008, Comput. Biol. Chem..

[16]  Michael Cariaso,et al.  SNPedia: a wiki supporting personal genome annotation, interpretation and analysis , 2011, Nucleic Acids Res..

[17]  D. Blacker,et al.  Systematic meta-analyses of Alzheimer disease genetic association studies: the AlzGene database , 2007, Nature Genetics.

[18]  Jason H. Moore,et al.  The Informative Extremes: Using Both Nearest and Farthest Individuals Can Improve Relief Algorithms in the Domain of Human Genetics , 2010, EvoBIO.

[19]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[20]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[21]  B. Stranger,et al.  Progress and Promise of Genome-Wide Association Studies for Human Complex Trait Genetics , 2011, Genetics.

[22]  D. Avramopoulos Genetics of Alzheimer's disease: recent advances , 2009, Genome Medicine.

[23]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[24]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[25]  Shyam Visweswaran,et al.  Application of a spatially-weighted Relief algorithm for ranking genetic predictors of disease , 2012, BioData Mining.

[26]  Xiang Zhou,et al.  Polygenic Modeling with Bayesian Sparse Linear Mixed Models , 2012, PLoS genetics.