Variable Selection, Sparse Meta-Analysis and Genetic Risk Prediction for Genome-Wide Association Studies

QIANCHUAN HE: Variable Selection, Sparse Meta-Analysis and Genetic Risk Prediction for Genome-Wide Association Studies (Under the direction of Dr. Danyu Lin and Dr. Hao Helen Zhang) Genome-wide association studies (GWAS) usually involve more than half a million single nucleotide polymorphisms (SNPs). The common practice of analyzing one SNP at a time does not fully realize the potential of GWAS to identify multiple causal variants and to predict risk of disease. Recently developed variable selection methods allow the joint analysis for GWAS data, but they tend to miss causal SNPs that are marginally uncorrelated with disease and have high false discovery rates (FDRs). Genetic risk prediction becomes highly challenging when the number of causal variants is large and many of the effects are weak. Existing methods mostly rely on marginal regression estimates, and their prediction power is quite limited. In meta-analysis, the involvement of multiple studies adds one more layer of complexity to variable selection. While existing variable selection methods can be potentially applied to meta-analysis, they require direct access to raw data, which are often difficult to be obtained. In the first part of this dissertation, we introduce GWASelect, a statistically powerful and computationally efficient variable selection method for analyzing GWAS data. This method searches iteratively over the potential SNPs conditional on previously selected SNPs and is thus capable of capturing causal SNPs that are marginally correlated with disease as well as those that are marginally uncorrelated with disease. A special resampling mechanism is built into the method to reduce false-positive findings.

[1]  J. Horowitz,et al.  Asymptotic properties of bridge estimators in sparse high-dimensional regression models , 2008, 0804.0693.

[2]  S B Hulley,et al.  CARDIA: study design, recruitment, and some characteristics of the examined subjects. , 1988, Journal of clinical epidemiology.

[3]  Jian Huang,et al.  Integrative analysis and variable selection with multiple high-dimensional data sets. , 2011, Biostatistics.

[4]  Johanna M Seddon,et al.  Prediction model for prevalence and incidence of advanced age-related macular degeneration based on genetic, demographic, and environmental variables. , 2009, Investigative ophthalmology & visual science.

[5]  C. Wijmenga,et al.  Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. , 2006, American journal of human genetics.

[6]  J. H. Noble Meta-analysis: Methods, strengths, weaknesses, and political uses. , 2006, The Journal of laboratory and clinical medicine.

[7]  Xiaodong Lin,et al.  Gene expression Gene selection using support vector machines with non-convex penalty , 2005 .

[8]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[9]  Jianqing Fan,et al.  Nonconcave penalized likelihood with a diverging number of parameters , 2004, math/0406466.

[10]  D Y Lin,et al.  Meta‐analysis of genome‐wide association studies: no efficiency gain in using individual participant data , 2009, Genetic epidemiology.

[11]  John P A Ioannidis,et al.  Meta-analysis in genome-wide association studies. , 2009, Pharmacogenomics.

[12]  Evangelos Evangelou,et al.  Heterogeneity in Meta-Analyses of Genome-Wide Association Investigations , 2007, PloS one.

[13]  Chun Li,et al.  GWAsimulator: a rapid whole-genome simulation program , 2007, Bioinform..

[14]  R. D'Agostino,et al.  Genotype score in addition to common risk factors for prediction of type 2 diabetes. , 2008, The New England journal of medicine.

[15]  J. Barrett,et al.  Genetic risk prediction in complex disease , 2011, Human molecular genetics.

[16]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[17]  P. Trainor,et al.  Cux2 (Cutl2) integrates neural progenitor development with cell-cycle progression during spinal cord neurogenesis , 2008, Development.

[18]  Hansheng Wang Forward Regression for Ultra-High Dimensional Variable Screening , 2009 .

[19]  F. Collins Has the revolution arrived? , 2010, Nature.

[20]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[21]  Hao Helen Zhang,et al.  ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS. , 2009, Annals of statistics.

[22]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[23]  Y. Pawitan,et al.  The pursuit of genome-wide association studies: where are we now? , 2010, Journal of Human Genetics.

[24]  Judy H. Cho,et al.  Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease , 2008, Nature Genetics.

[25]  J. Ioannidis,et al.  Meta-Analysis in Genome-Wide Association Datasets: Strategies and Application in Parkinson Disease , 2007, PLoS ONE.

[26]  P. Pfluger,et al.  Adipocyte LDL receptor-related protein-1 expression modulates postprandial lipid transport and glucose homeostasis in mice. , 2007, The Journal of clinical investigation.

[27]  M. McCarthy,et al.  Replication of Genome-Wide Association Signals in UK Samples Reveals Risk Loci for Type 2 Diabetes , 2007, Science.

[28]  D. Zeng,et al.  On the relative efficiency of using summary statistics versus individual-level data in meta-analysis. , 2010, Biometrika.

[29]  Cornelia M van Duijn,et al.  Genome-based prediction of common diseases: advances and prospects. , 2008, Human molecular genetics.

[30]  R. Kronmal,et al.  Multi-Ethnic Study of Atherosclerosis: objectives and design. , 2002, American journal of epidemiology.

[31]  Bin Nan,et al.  Hierarchically penalized Cox regression with grouped variables , 2009 .

[32]  Qing Lu,et al.  Using the optimal receiver operating characteristic curve to design a predictive genetic test, exemplified with type 2 diabetes. , 2008, American journal of human genetics.

[33]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[34]  P. Elliott,et al.  Size matters: just how big is BIG? , 2008, International journal of epidemiology.

[35]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[36]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[37]  Valerie Obenchain,et al.  Risk prediction using genome‐wide association studies , 2010, Genetic epidemiology.

[38]  Chenlei Leng,et al.  Unified LASSO Estimation by Least Squares Approximation , 2007 .

[39]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[40]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[41]  Brent A. Johnson,et al.  Penalized Estimating Functions and Variable Selection in Semiparametric Regression Models , 2008, Journal of the American Statistical Association.

[42]  Peter Kraft,et al.  Evaluation of polygenic risk scores for predicting breast and prostate cancer risk , 2011, Genetic epidemiology.

[43]  M. Daly,et al.  Genetic Mapping in Human Disease , 2008, Science.

[44]  Peter Bühlmann,et al.  p-Values for High-Dimensional Regression , 2008, 0811.2177.

[45]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[46]  Joseph T. Glessner,et al.  From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes , 2009, PLoS genetics.

[47]  Ji Zhu,et al.  Regularized Multivariate Regression for Identifying Master Predictors with Application to Integrative Genomics Study of Breast Cancer. , 2008, The annals of applied statistics.

[48]  D. Rubin,et al.  Isomer-specific effects of CLA on gene expression in human adipose tissue depending on PPARγ2 P12A polymorphism: a double blind, randomized, controlled cross-over study , 2009 .

[49]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[50]  G. Abecasis,et al.  A Genome-Wide Association Study of Type 2 Diabetes in Finns Detects Multiple Susceptibility Variants , 2007, Science.

[51]  A. Dobra Variable selection and dependency networks for genomewide data. , 2009, Biostatistics.

[52]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[53]  Jiahua Chen,et al.  Extended Bayesian information criteria for model selection with large model spaces , 2008 .

[54]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[55]  Hua Zhou,et al.  Association screening of common and rare genetic variants by penalized regression , 2010, Bioinform..

[56]  H. Su,et al.  Interaction of CED-6/GULP, an Adapter Protein Involved in Engulfment of Apoptotic Cells with CED-1 and CD91/Low Density Lipoprotein Receptor-related Protein (LRP)* , 2002, The Journal of Biological Chemistry.

[57]  M. McCarthy,et al.  Genome-wide association studies for complex traits: consensus, uncertainty and challenges , 2008, Nature Reviews Genetics.

[58]  Yi Li,et al.  Principled sure independence screening for Cox models with ultra-high-dimensional covariates , 2012, J. Multivar. Anal..

[59]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[60]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[61]  K. Lange,et al.  Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.

[62]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .

[63]  A. V. D. Vaart Asymptotic Statistics: Delta Method , 1998 .

[64]  M. McCarthy,et al.  Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes , 2008, Nature Genetics.

[65]  Xue-wen Chen,et al.  A Markov blanket-based method for detecting causal SNPs in GWAS , 2010, BMC Bioinformatics.

[66]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[67]  M. Yuan,et al.  On the non‐negative garrotte estimator , 2007 .

[68]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[69]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[70]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[71]  Hongyu Zhao,et al.  Practical Issues in Building Risk-Predicting Models for Complex Diseases , 2010, Journal of biopharmaceutical statistics.

[72]  Ker-Chau Li,et al.  Genome-wide coexpression dynamics: Theory and application , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[73]  C. Hoggart,et al.  Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies , 2008, PLoS genetics.

[74]  A. Folsom,et al.  The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. The ARIC investigators. , 1989, American journal of epidemiology.

[75]  D. Clayton,et al.  Genome-wide association study and meta-analysis finds over 40 loci affect risk of type 1 diabetes , 2009, Nature Genetics.

[76]  V. Yohai,et al.  ASYMPTOTIC BEHAVIOR OF M-ESTIMATORS FOR THE LINEAR MODEL , 1979 .

[77]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[78]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[79]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[80]  S. Zeisel Nutrigenomics and metabolomics will change clinical nutrition and public health practice: insights from studies on dietary requirements for choline. , 2007, The American journal of clinical nutrition.

[81]  Karen L. Mohlke,et al.  Genetic Risk Prediction — Are We There Yet? , 2009 .

[82]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[83]  Lan Wang,et al.  GEE analysis of clustered binary data with diverging number of covariates , 2011, 1103.1795.

[84]  Peter M Visscher,et al.  Harnessing the information contained within genome-wide association studies to improve individual prediction of complex disease risk. , 2009, Human molecular genetics.

[85]  M. Alarcón‐Riquelme,et al.  Early disease onset is predicted by a higher genetic risk for lupus and is associated with a more severe phenotype in lupus patients , 2010, Annals of the rheumatic diseases.

[86]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[87]  Manuel A. R. Ferreira,et al.  Genetics and population analysis A multivariate test of association , 2009 .

[88]  J. Ioannidis Why Most Published Research Findings Are False , 2019, CHANCE.

[89]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[90]  Cun-Hui Zhang,et al.  A group bridge approach for variable selection , 2009, Biometrika.

[91]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[92]  G. Collins,et al.  Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting , 2011, BMC medicine.

[93]  L. J. Wei,et al.  Regression analysis of multivariate incomplete failure time data by modeling marginal distributions , 1989 .

[94]  Laura J. Scott,et al.  Edinburgh Research Explorer Genome-wide association scan meta-analysis identifies three loci influencing adiposity and fat distribution , 2022 .

[95]  J. Korn,et al.  Family-based genetic risk prediction of multifactorial disease , 2010, Genome Medicine.

[96]  Annie Qu,et al.  Penalized Generalized Estimating Equations for High‐Dimensional Longitudinal Data Analysis , 2012, Biometrics.

[97]  Peter M Visscher,et al.  Prediction of individual genetic risk to disease from genome-wide association studies. , 2007, Genome research.

[98]  A. Paterson,et al.  Pathway-based joint effects analysis of rare genetic variants using Genetic Analysis Workshop 17 exon sequence data , 2011, BMC proceedings.

[99]  R. Jirtle,et al.  Environmental epigenomics and disease susceptibility , 2007, Nature Reviews Genetics.

[100]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[101]  M. Fornage,et al.  A Phenomics-Based Strategy Identifies Loci on APOC1, BRAP, and PLCG1 Associated with Metabolic Syndrome Phenotype Domains , 2011, PLoS genetics.

[102]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .