Evaluation and application of summary statistic imputation to discover new height-associated loci

Abstract As most of the heritability of complex traits is attributed to common and low frequency genetic variants, imputing them by combining genotyping chips and large sequenced reference panels is the most cost-effective approach to discover the genetic basis of these traits. Association summary statistics from genome-wide meta-analyses are available for hundreds of traits. Updating these to ever-increasing reference panels is very cumbersome as it requires reimputation of the genetic data, rerunning the association scan, and meta-analysing the results. A much more efficient method is to directly impute the summary statistics, termed as summary statistics imputation. Its performance relative to genotype imputation and practical utility has not yet been fully investigated. To this end, we compared the two approaches on real (genotyped and imputed) data from 120K samples from the UK Biobank and show that, while genotype imputation boasts a 2- to 5-fold lower root-mean-square error, summary statistics imputation better distinguishes true associations from null ones: We observed the largest differences in power for variants with low minor allele frequency and low imputation quality. For fixed false positive rates of 0.001, 0.01, 0.05, using summary statistics imputation yielded an increase in statistical power by 15, 10 and 3%, respectively. To test its capacity to discover novel associations, we applied summary statistics imputation to the GIANT height meta-analysis summary statistics covering HapMap variants, and identified 34 novel loci, 19 of which replicated using data in the UK Biobank. Additionally, we successfully replicated 55 out of the 111 variants published in an exome chip study. Our study demonstrates that summary statistics imputation is a very efficient and cost-effective way to identify and fine-map trait-associated loci. Moreover, the ability to impute summary statistics is important for follow-up analyses, such as Mendelian randomisation or LD-score regression. Author summary Genome-wide association studies (GWASs) quantify the effect of genetic variants and traits, such as height. Such estimates are called association summary statistics and are typically publicly shared through publication. Typically, GWASs are carried out by genotyping ~ 500′000 SNVs for each individual which are then combined with sequenced reference panels to infer untyped SNVs in each’ individuals genome. This process of genotype imputation is resource intensive and can therefore be a limitation when combining many GWASs. An alternative approach is to bypass the use of individual data and directly impute summary statistics. In our work we compare the performance of summary statistics imputation to genotype imputation. Although we observe a 2- to 5-fold lower RMSE for genotype imputation compared to summary statistics imputation, summary statistics imputation better distinguishes true associations from null results. Furthermore, we demonstrate the potential of summary statistics imputation by presenting 34 novel height-associated loci, 19 of which were confirmed in UK Biobank. Our study demonstrates that given current reference panels, summary statistics imputation is a very efficient and cost-effective way to identify common or low-frequency trait-associated loci.

[1]  Aaron F. McDaid,et al.  Improved imputation of summary statistics for admixed populations , 2018, bioRxiv.

[2]  P. Visscher,et al.  Quantifying the mapping precision of genome-wide association studies using whole-genome sequencing data , 2017, Genome Biology.

[3]  Marcelo P. Segura-Lepe,et al.  Rare and low-frequency coding variants alter human adult height , 2016, Nature.

[4]  A. Price,et al.  Dissecting the genetics of complex traits using summary association statistics , 2016, Nature Reviews Genetics.

[5]  James R. Staley,et al.  PhenoScanner: a database of human genotype–phenotype associations , 2016, Bioinform..

[6]  M. Pirinen,et al.  Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of LPA , 2016, Nature Communications.

[7]  Judy H. Cho,et al.  Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations , 2015, Nature Genetics.

[8]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[9]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[10]  Gonçalo R. Abecasis,et al.  Minimac2: Faster Genotype Imputation , 2015, Bioinform..

[11]  C. Morrison,et al.  Hormonal Contraception and the Risk of HIV Acquisition: An Individual Participant Data Meta-analysis , 2015, PLoS medicine.

[12]  Donghyung Lee,et al.  JEPEG: a summary statistics based tool for gene-level joint testing of functional variants , 2014, Bioinform..

[13]  P. Visscher,et al.  Another Explanation for Apparent Epistasis , 2014 .

[14]  Ross M. Fraser,et al.  Defining the role of common variation in the genomic and biological architecture of adult human height , 2014, Nature Genetics.

[15]  Joseph E. Powell,et al.  Detection and replication of epistasis influencing transcription in humans , 2014, Nature.

[16]  M. Daly,et al.  LD Score regression distinguishes confounding from polygenicity in genome-wide association studies , 2014, Nature Genetics.

[17]  Jun S. Liu,et al.  Genetics of rheumatoid arthritis contributes to biology and drug discovery , 2013 .

[18]  Joseph K. Pickrell Joint analysis of functional genomic data and genome-wide association studies of 18 human traits , 2013, bioRxiv.

[19]  Tanya M. Teslovich,et al.  Discovery and refinement of loci associated with lipid levels , 2013, Nature Genetics.

[20]  A. Butterworth,et al.  Mendelian Randomization Analysis With Multiple Genetic Variants Using Summarized Data , 2013, Genetic epidemiology.

[21]  Gaurav Bhatia,et al.  Fast and accurate imputation of summary statistics enhances evidence of functional enrichment , 2013, Bioinform..

[22]  Donghyung Lee,et al.  DIST: direct imputation of summary statistics for unmeasured SNPs , 2013, Bioinform..

[23]  Luigi Ferrucci,et al.  Imputation of Variants from the 1000 Genomes Project Modestly Improves Known Associations and Can Identify Low-frequency Variant - Phenotype Associations Undetected by HapMap Based Imputation , 2013, PloS one.

[24]  J. Philbeck,et al.  Hyper-Arousal Decreases Human Visual Thresholds , 2013, PLoS ONE.

[25]  Alireza Moayyeri,et al.  COHORT PROFILE Cohort Profile : TwinsUK and Healthy Ageing Twin Study , 2013 .

[26]  P. Suñé,et al.  Positive Outcomes Influence the Rate and Time to Publication, but Not the Impact Factor of Publications of Clinical Trial Results , 2013, PloS one.

[27]  Tanya M. Teslovich,et al.  Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes , 2012, Nature Genetics.

[28]  D. Lawlor,et al.  Cohort Profile: The ‘Children of the 90s’—the index offspring of the Avon Longitudinal Study of Parents and Children , 2012, International journal of epidemiology.

[29]  M. Lathrop,et al.  Genome-wide association study of HPV seropositivity. , 2011, Human molecular genetics.

[30]  J. Marchini,et al.  Genotype Imputation with Thousands of Genomes , 2011, G3: Genes | Genomes | Genetics.

[31]  Peter Donnelly,et al.  HAPGEN2: simulation of multiple disease SNPs , 2011, Bioinform..

[32]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[33]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[34]  Matthew Stephens,et al.  USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. , 2010, The annals of applied statistics.

[35]  Ayellet V. Segrè,et al.  Hundreds of variants clustered in genomic loci and biological pathways affect human height , 2010, Nature.

[36]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[37]  Montgomery Slatkin,et al.  Linkage disequilibrium — understanding the evolutionary past and mapping the medical future , 2008, Nature Reviews Genetics.

[38]  Eden R Martin,et al.  A multiple testing correction method for genetic association studies using correlated single nucleotide polymorphisms , 2008, Genetic epidemiology.

[39]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[40]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[41]  M. L. Eaton Multivariate statistics : a vector space approach , 1985 .

[42]  H. Theil,et al.  Economic Forecasts and Policy. , 1959 .

[43]  O. Delaneau,et al.  UK Biobank Phasing and Imputation Documentation Version 1 . 2 13 November 2015 documentation author , 2015 .

[44]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[45]  Sven Bergmann,et al.  Methods for testing association between uncertain genotypes and quantitative traits. , 2011, Biostatistics.