A collective ranking method for genome-wide association studies

Genome-wide association studies (GWAS) analyze genetic variation (SNPs) across the entire human genome, searching for SNPs that are associated with certain phenotypes, most often diseases, such as breast cancer. In GWAS, we seek a ranking of SNPs in terms of their relevance to the given phenotype. However, because certain SNPs are known to be highly correlated with one another across individuals, it can be beneficial to take into account these correlations when ranking. If a SNP appears associated with the phenotype, and we question whether this association is real, the extent to which its neighbors (correlated SNPs) also appear associated can be informative. Therefore, we propose CollectRank, a ranking approach which allows SNPs to reinforce one another via the correlation structure. CollectRank is loosely analogous to the well-known PageRank algorithm. We first evaluate CollectRank on synthetic data generated from a variety of genetic models under different settings. The numerical results suggest CollectRank can significantly outperform common GWAS methods at the cost of a small amount of extra computation. We further evaluate CollectRank on two real-world GWAS on breast cancer and atrial fibrillation/flutter, and CollectRank performs well in both studies. We finally provide a theoretical analysis that also suggests CollectRank's advantages.

[1]  Joseph L. Gastwirth,et al.  Trend Tests for Case-Control Studies of Genetic Markers: Power, Sample Size and Robustness , 2002, Human Heredity.

[2]  Peter Donnelly,et al.  HAPGEN2: simulation of multiple disease SNPs , 2011, Bioinform..

[3]  Eran Halperin,et al.  Leveraging the HapMap correlation structure in association studies. , 2007, American journal of human genetics.

[4]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[5]  S. Gabriel,et al.  Efficiency and power in genetic association studies , 2005, Nature Genetics.

[6]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[7]  R. Wilkins Polygenes, risk prediction, and targeted prevention of breast cancer. , 2008, The New England journal of medicine.

[8]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  David Page,et al.  Predicting cancer susceptibility from single-nucleotide polymorphism data: a case study in multiple myeloma , 2005, BIOKDD.

[10]  D. Schaid,et al.  Case-Control Studies of Genetic Markers: Power and Sample Size Approximations for Armitage’s Test for Trend , 2001, Human Heredity.

[11]  C. McCarty,et al.  Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. , 2005, Personalized medicine.

[12]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[13]  J. Kaprio,et al.  Environmental and heritable factors in the causation of cancer--analyses of cohorts of twins from Sweden, Denmark, and Finland. , 2000, The New England journal of medicine.

[14]  W. G. Cochran Some Methods for Strengthening the Common χ 2 Tests , 1954 .

[15]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[16]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[17]  Eleazar Eskin,et al.  Increasing Power in Association Studies by Using Linkage Disequilibrium Structure and Molecular Function as Prior Information , 2008, RECOMB.

[18]  E. Eskin,et al.  Efficient Association Study Design Via Power‐Optimized Tag SNP Selection , 2008, Annals of human genetics.

[19]  P. Tam The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome) , 2003 .

[20]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[21]  J. Pritchard,et al.  Linkage disequilibrium in humans: models and data. , 2001, American journal of human genetics.

[22]  P. Armitage Tests for Linear Trends in Proportions and Frequencies , 1955 .

[23]  Joseph T. Glessner,et al.  From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes , 2009, PLoS genetics.

[24]  W. Willett,et al.  A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer , 2007, Nature Genetics.

[25]  Deanne M. Taylor,et al.  Powerful SNP-set analysis for case-control genome-wide association studies. , 2010, American journal of human genetics.

[26]  Lester L. Peters,et al.  Genome-wide association study identifies novel breast cancer susceptibility loci , 2007, Nature.

[27]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[28]  Eran Halperin,et al.  Maximizing power in association studies , 2009, Nature Biotechnology.