DISSCO: direct imputation of summary statistics allowing covariates

BACKGROUND Imputation of individual level genotypes at untyped markers using an external reference panel of genotyped or sequenced individuals has become standard practice in genetic association studies. Direct imputation of summary statistics can also be valuable, for example in meta-analyses where individual level genotype data are not available. Two methods (DIST and ImpG-Summary/LD), that assume a multivariate Gaussian distribution for the association summary statistics, have been proposed for imputing association summary statistics. However, both methods assume that the correlations between association summary statistics are the same as the correlations between the corresponding genotypes. This assumption can be violated in the presence of confounding covariates. METHODS We analytically show that in the absence of covariates, correlation among association summary statistics is indeed the same as that among the corresponding genotypes, thus serving as a theoretical justification for the recently proposed methods. We continue to prove that in the presence of covariates, correlation among association summary statistics becomes the partial correlation of the corresponding genotypes controlling for covariates. We therefore develop direct imputation of summary statistics allowing covariates (DISSCO). RESULTS We consider two real-life scenarios where the correlation and partial correlation likely make practical difference: (i) association studies in admixed populations; (ii) association studies in presence of other confounding covariate(s). Application of DISSCO to real datasets under both scenarios shows at least comparable, if not better, performance compared with existing correlation-based methods, particularly for lower frequency variants. For example, DISSCO can reduce the absolute deviation from the truth by 3.9-15.2% for variants with minor allele frequency <5%.

[1]  Karen L. Mohlke,et al.  Novel Loci for Adiponectin Levels and Their Influence on Type 2 Diabetes and Metabolic Traits: A Multi-Ethnic Meta-Analysis of 45,891 Individuals , 2012, PLoS genetics.

[2]  J. Marchini,et al.  Fast and accurate genotype imputation in genome-wide association studies through pre-phasing , 2012, Nature Genetics.

[3]  L. Liang,et al.  Extremely low-coverage sequencing and imputation increases power for genome-wide association studies , 2012, Nature Genetics.

[4]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[5]  Christopher A. Haiman,et al.  Use of weighted reference panels based on empirical estimates of ancestry for capturing untyped variation , 2009, Human Genetics.

[6]  M. Boehnke,et al.  So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests. , 2007, American journal of human genetics.

[7]  D Hoffmann,et al.  Smoking and lung cancer: scientific challenges and opportunities. , 1994, Cancer research.

[8]  Gonçalo Abecasis,et al.  Genotype-imputation accuracy across worldwide human populations. , 2009, American journal of human genetics.

[9]  Karen L Mohlke,et al.  Comparison of ENCODE region SNPs between Cebu Filipino and Asian HapMap samples , 2007, Journal of Human Genetics.

[10]  Christian Gieger,et al.  Genome-wide association study identifies loci influencing concentrations of liver enzymes in plasma , 2011, Nature Genetics.

[11]  Christian Gieger,et al.  Genome-wide meta-analysis identifies 11 new loci for anthropometric traits and provides insights into genetic architecture , 2013, Nature Genetics.

[12]  Yusuke Nakamura,et al.  Genome-Wide Association Study of White Blood Cell Count in 16,388 African Americans: the Continental Origins and Genetic Epidemiology Network (COGENT) , 2011, PLoS genetics.

[13]  Gaurav Bhatia,et al.  Fast and accurate imputation of summary statistics enhances evidence of functional enrichment , 2013, Bioinform..

[14]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[15]  G. Abecasis,et al.  Genotype imputation. , 2009, Annual review of genomics and human genetics.

[16]  Zhaoxia Yu,et al.  Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. , 2009, American journal of human genetics.

[17]  Yun Li,et al.  Imputation of coding variants in African Americans: better performance using data from the exome sequencing project , 2013, Bioinform..

[18]  Yun Li,et al.  Genome-wide association study of homocysteine levels in Filipinos provides evidence for CPS1 in women and a stronger MTHFR effect in young adults. , 2010, Human molecular genetics.

[19]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[20]  B. Popkin,et al.  Cohort profile: the Cebu longitudinal health and nutrition survey. , 2011, International journal of epidemiology.

[21]  Yun Li,et al.  Genome-wide association study for adiponectin levels in Filipino women identifies CDH13 and a novel uncommon haplotype at KNG1-ADIPOQ. , 2010, Human molecular genetics.

[22]  Wei Wang,et al.  MaCH‐Admix: Genotype Imputation for Admixed Populations , 2013, Genetic epidemiology.

[23]  Andre Franke,et al.  1000 Genomes-based imputation identifies novel and refined associations for the Wellcome Trust Case Control Consortium phase 1 Data , 2012, European Journal of Human Genetics.

[24]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[25]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[26]  Donghyung Lee,et al.  DIST: direct imputation of summary statistics for unmeasured SNPs , 2013, Bioinform..

[27]  John P A Ioannidis,et al.  Meta-analysis in genome-wide association studies. , 2009, Pharmacogenomics.

[28]  Matthew Stephens,et al.  USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. , 2010, The annals of applied statistics.

[29]  N. Risch,et al.  Estimation of individual admixture: Analytical and study design considerations , 2005, Genetic epidemiology.

[30]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[31]  D. Conrad,et al.  Using Population Mixtures to Optimize the Utility of Genomic Databases: Linkage Disequilibrium and Association Study Design in India , 2008, Annals of human genetics.

[32]  Eleazar Eskin,et al.  Increasing Power of Genome-Wide Association Studies by Collecting Additional Single-Nucleotide Polymorphisms , 2011, Genetics.

[33]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[34]  JoAnn E. Manson,et al.  Design of the Women's Health Initiative clinical trial and observational study. The Women's Health Initiative Study Group. , 1998, Controlled clinical trials.

[35]  Benjamin A. Logsdon,et al.  Imputation of exome sequence variants into population- based samples and blood-cell-trait-associated loci in African Americans: NHLBI GO Exome Sequencing Project. , 2012, American journal of human genetics.

[36]  Yun Li,et al.  Population-specific coding variant underlies genome-wide association with adiponectin level. , 2012, Human molecular genetics.

[37]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[38]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[39]  Eleazar Eskin,et al.  Postassociation cleaning using linkage disequilibrium information , 2011, Genetic epidemiology.

[40]  Eleazar Eskin,et al.  Rapid and Accurate Multiple Testing Correction and Power Estimation for Millions of Correlated Markers , 2009, PLoS genetics.