Risk prediction using genome‐wide association studies

Over the last few years, many new genetic associations have been identified by genome‐wide association studies (GWAS). There are potentially many uses of these identified variants: a better understanding of disease etiology, personalized medicine, new leads for studying underlying biology, and risk prediction. Recently, there has been some skepticism regarding the prospects of risk prediction using GWAS, primarily motivated by the fact that individual effect sizes of variants associated with the phenotype are mostly small. However, there have also been arguments that many disease‐associated variants have not yet been identified; hence, prospects for risk prediction may improve if more variants are included. From a risk prediction perspective, it is reasonable to average a larger number of predictors, of which some may have (limited) predictive power, and some actually may be noise. The idea being that when added together, the combined small signals results in a signal that is stronger than the noise from the unrelated predictors. We examine various aspects of the construction of models for the estimation of disease probability. We compare different methods to construct such models, to examine how implementation of cross‐validation may influence results, and to examine which single nucleotide polymorphisms (SNPs) are most useful for prediction. We carry out our investigation on GWAS of the Welcome Trust Case Control Consortium. For Crohn's disease, we confirm our results on another GWAS. Our results suggest that utilizing a larger number of SNPs than those which reach genome‐wide significance, for example using the lasso, improves the construction of risk prediction models. Genet. Epidemiol. 34: 643‐652, 2010. © 2010 Wiley‐Liss, Inc.

[1]  N. Cook Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction , 2007, Circulation.

[2]  M. Gail,et al.  Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. , 1989, Journal of the National Cancer Institute.

[3]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[4]  J. Pritchard,et al.  Overcoming the winner's curse: estimating penetrance parameters from case-control data. , 2007, American journal of human genetics.

[5]  J. Chan,et al.  Construction of a prediction model for type 2 diabetes mellitus in the Japanese population based on 11 genes with strong evidence of the association , 2009, Journal of Human Genetics.

[6]  Joseph T. Glessner,et al.  From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes , 2009, PLoS genetics.

[7]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[8]  Judy H. Cho,et al.  A Genome-Wide Association Study Identifies IL23R as an Inflammatory Bowel Disease Gene , 2006, Science.

[9]  L. T. Middleton,et al.  Risk prediction of prevalent diabetes in a Swiss population using a weighted genetic score—the CoLaus Study , 2009, Diabetologia.

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[12]  D Spiegelman,et al.  Validation of the Gail et al. model of breast cancer risk prediction and implications for chemoprevention. , 2001, Journal of the National Cancer Institute.

[13]  Peter Kraft,et al.  Genetic risk prediction--are we there yet? , 2009, The New England journal of medicine.

[14]  P. Qiu The Statistical Evaluation of Medical Tests for Classification and Prediction , 2005 .

[15]  P. Visscher,et al.  Common polygenic variation contributes to risk of schizophrenia and bipolar disorder , 2009, Nature.

[16]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[17]  Judy H Cho,et al.  Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis , 2007, Nature Genetics.

[18]  Pär Stattin,et al.  Cumulative association of five genetic variants with prostate cancer. , 2008, The New England journal of medicine.

[19]  Mee Young Park,et al.  Penalized logistic regression for detecting gene interactions. , 2008, Biostatistics.

[20]  D. Bonthron,et al.  Genetic heterogeneity in X-linked hydrocephalus: linkage to markers within Xq27.3. , 1994, American journal of human genetics.

[21]  D Spiegelman,et al.  Validation of the Gail et al. model for predicting individual breast cancer risk. , 1994, Journal of the National Cancer Institute.

[22]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[23]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[24]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[25]  Nancy R. Cook,et al.  Use and Misuse of the Receiver Operating Characteristic Curve in Risk Prediction , 2007, Circulation.

[26]  Peter M Visscher,et al.  Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. , 2009, Human molecular genetics.

[27]  R. Prentice,et al.  Bias-reduced estimators and confidence intervals for odds ratios in genome-wide association studies. , 2008, Biostatistics.

[28]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[29]  D. L. Donoho,et al.  Ideal spacial adaptation via wavelet shrinkage , 1994 .

[30]  Alberto Piazza,et al.  Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants , 2009, Nature Genetics.

[31]  M. Gail Value of adding single-nucleotide polymorphism genotypes to a breast cancer risk model. , 2009, Journal of the National Cancer Institute.