Efficient cross-trait penalized regression increases prediction accuracy in large cohorts using secondary phenotypes

We introduce cross-trait penalized regression (CTPR), a powerful and practical approach for multi-trait polygenic risk prediction in large cohorts. Specifically, we propose a novel cross-trait penalty function with the Lasso and the minimax concave penalty (MCP) to incorporate the shared genetic effects across multiple traits for large-sample GWAS data. Our approach extracts information from the secondary traits that is beneficial for predicting the primary trait based on individual-level genotypes and/or summary statistics. Our novel implementation of a parallel computing algorithm makes it feasible to apply our method to biobank-scale GWAS data. We illustrate our method using large-scale GWAS data (~1M SNPs) from the UK Biobank (N = 456,837). We show that our multi-trait method outperforms the recently proposed multi-trait analysis of GWAS (MTAG) for predictive performance. The prediction accuracy for height by the aid of BMI improves from R2 = 35.8% (MTAG) to 42.5% (MCP + CTPR) or 42.8% (Lasso + CTPR) with UK Biobank data.Information of genetic architectures of complex traits can be leveraged for predicting phenotypes. Here, the authors develop CTPR (Cross-Trait Penalized Regression), a method for multi-trait polygenic risk prediction using individual-level genotypes and/or summary statistics from large cohorts.

[1]  L. Liang,et al.  A comprehensive survey of genetic variation in 20 , 691 subjects from four large cohorts 1 , 2016 .

[2]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[3]  Justin Zobel,et al.  Performance and Robustness of Penalized and Unpenalized Methods for Genetic Prediction of Complex Human Disease , 2013, Genetic epidemiology.

[4]  K. Lange,et al.  Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.

[5]  M. Goddard,et al.  Prediction of total genetic value using genome-wide dense marker maps. , 2001, Genetics.

[6]  I. Deary,et al.  Genome-wide association study of alcohol consumption and genetic overlap with other health-related traits in UK Biobank (N= 112117) , 2017 .

[7]  M. Daly,et al.  An Atlas of Genetic Correlations across Human Diseases and Traits , 2015, Nature Genetics.

[8]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[9]  Peter Kraft,et al.  Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis , 2012, Nature Genetics.

[10]  L. Liang,et al.  A genome-wide cross trait analysis from UK Biobank highlights the shared genetic architecture of asthma and allergic diseases , 2018, Nature Genetics.

[11]  M. Daly,et al.  LD Score regression distinguishes confounding from polygenicity in genome-wide association studies , 2014, Nature Genetics.

[12]  P Welsh,et al.  Dietary fat and total energy intake modifies the association of genetic profile risk score on obesity: evidence from 48 170 UK Biobank participants , 2017, International Journal of Obesity.

[13]  Nilanjan Chatterjee,et al.  Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits , 2018, Nature Genetics.

[14]  E. Xing,et al.  Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network , 2009, PLoS genetics.

[15]  A. Robeznieks,et al.  Come and get IT. , 2010, Modern healthcare.

[16]  Justin Zobel,et al.  Accurate and Robust Genomic Prediction of Celiac Disease Using Statistical Learning , 2013, PLoS genetics.

[17]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[18]  P. Visscher,et al.  Common polygenic variation contributes to risk of schizophrenia and bipolar disorder , 2009, Nature.

[19]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[20]  P. Visscher,et al.  Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores , 2015, bioRxiv.

[21]  Tanya M. Teslovich,et al.  Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index , 2010 .

[22]  O. Delaneau,et al.  Supplementary Information for ‘ Improved whole chromosome phasing for disease and population genetic studies ’ , 2012 .

[23]  Louis Lello,et al.  Accurate Genomic Prediction of Human Height , 2017, Genetics.

[24]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[25]  P. Visscher,et al.  Pitfalls of predicting complex traits from SNPs , 2013, Nature Reviews Genetics.

[26]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[27]  R. Fernando,et al.  Prediction of Complex Human Traits Using the Genomic Best Linear Unbiased Predictor , 2013, PLoS genetics.

[28]  I. Deary,et al.  Genome-wide association study of alcohol consumption and genetic overlap with other health-related traits in UK Biobank (N=112 117) , 2017, Molecular Psychiatry.

[29]  Stephan Ripke,et al.  Improving genetic prediction by leveraging genetic correlations among human diseases and traits , 2018, Nature Communications.

[30]  A. Price,et al.  Dissecting the genetics of complex traits using summary association statistics , 2016, Nature Reviews Genetics.

[31]  Laura J. Scott,et al.  Joint Analysis of Psychiatric Disorders Increases Accuracy of Risk Prediction for Schizophrenia, Bipolar Disorder, and Major Depressive Disorder , 2015, American journal of human genetics.

[32]  Bonnie Berger,et al.  Efficient Bayesian mixed model analysis increases association power in large cohorts , 2014 .

[33]  Kai Wang,et al.  Accounting for linkage disequilibrium in genome-wide association studies: A penalized regression method. , 2013, Statistics and its interface.

[34]  P. Visscher,et al.  Multi-trait analysis of genome-wide association summary statistics using MTAG , 2017, Nature Genetics.

[35]  Hongzhe Li,et al.  In Response to Comment on "Network-constrained regularization and variable selection for analysis of genomic data" , 2008, Bioinform..

[36]  R. Fernando,et al.  The Impact of Genetic Relationship Information on Genome-Assisted Breeding Values , 2007, Genetics.

[37]  Can Yang,et al.  Improving genetic risk prediction by leveraging pleiotropy , 2013, Human Genetics.

[38]  J. Marchini,et al.  Fast and accurate genotype imputation in genome-wide association studies through pre-phasing , 2012, Nature Genetics.

[39]  D. Politis,et al.  Subsampling p-values , 2010 .

[40]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[41]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[42]  J. Marchini,et al.  Genotype Imputation with Thousands of Genomes , 2011, G3: Genes | Genomes | Genetics.

[43]  K. Rawlik,et al.  Explorer Evidence for sex-specific genetic architectures across a spectrum of human complex traits , 2016 .

[44]  Eric S. Lander,et al.  A polygenic burden of rare disruptive mutations in schizophrenia , 2014, Nature.

[45]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[46]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[47]  J. Horwood UK Biobank Data: Come and Get It , 2014 .

[48]  Ayellet V. Segrè,et al.  Hundreds of variants clustered in genomic loci and biological pathways affect human height , 2010, Nature.

[49]  N. Mehta Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. , 2011, Circulation. Cardiovascular genetics.