DISTMIX: direct imputation of summary statistics for unmeasured SNPs from mixed ethnicity cohorts

Motivation: To increase the signal resolution for large-scale meta-analyses of genome-wide association studies, genotypes at unmeasured single nucleotide polymorphisms (SNPs) are commonly imputed using large multi-ethnic reference panels. However, the ever increasing size and ethnic diversity of both reference panels and cohorts makes genotype imputation computationally challenging for moderately sized computer clusters. Moreover, genotype imputation requires subject-level genetic data, which unlike summary statistics provided by virtually all studies, is not publicly available. While there are much less demanding methods which avoid the genotype imputation step by directly imputing SNP statistics, e.g. Directly Imputing summary STatistics (DIST) proposed by our group, their implicit assumptions make them applicable only to ethnically homogeneous cohorts. Results: To decrease computational and access requirements for the analysis of cosmopolitan cohorts, we propose DISTMIX, which extends DIST capabilities to the analysis of mixed ethnicity cohorts. The method uses a relevant reference panel to directly impute unmeasured SNP statistics based only on statistics at measured SNPs and estimated/user-specified ethnic proportions. Simulations show that the proposed method adequately controls the Type I error rates. The 1000 Genomes panel imputation of summary statistics from the ethnically diverse Psychiatric Genetic Consortium Schizophrenia Phase 2 suggests that, when compared to genotype imputation methods, DISTMIX offers comparable imputation accuracy for only a fraction of computational resources. Availability and implementation: DISTMIX software, its reference population data, and usage examples are publicly available at http://code.google.com/p/distmix. Contact: dlee4@vcu.edu Supplementary information: Supplementary Data are available at Bioinformatics online.

[1]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[2]  Eran Halperin,et al.  A generic coalescent‐based framework for the selection of a reference panel for imputation , 2010, Genetic epidemiology.

[3]  N. Wray,et al.  A mega-analysis of genome-wide association studies for major depressive disorder , 2013, Molecular Psychiatry.

[4]  O. Delaneau,et al.  A linear complexity phasing method for thousands of genomes , 2011, Nature Methods.

[5]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[6]  Ross M. Fraser,et al.  A General Approach for Haplotype Phasing across the Full Spectrum of Relatedness , 2014, PLoS genetics.

[7]  Donghyung Lee,et al.  JEPEG: a summary statistics based tool for gene-level joint testing of functional variants , 2014, Bioinform..

[8]  W. G. Hill,et al.  The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis , 2009, PLoS genetics.

[9]  Joseph K. Pickrell Joint analysis of functional genomic data and genome-wide association studies of 18 human traits , 2013, bioRxiv.

[10]  J. Marchini,et al.  Genotype Imputation with Thousands of Genomes , 2011, G3: Genes | Genomes | Genetics.

[11]  Olivier Delaneau,et al.  Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel , 2014, Nature Communications.

[12]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[13]  Wei Wang,et al.  MaCH‐Admix: Genotype Imputation for Admixed Populations , 2013, Genetic epidemiology.

[14]  M. Stephens,et al.  Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits , 2007, PLoS genetics.

[15]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[16]  Donghyung Lee,et al.  DIST: direct imputation of summary statistics for unmeasured SNPs , 2013, Bioinform..

[17]  Disorder Working Group Large-scale genome-wide association analysis of bipolar disorder identifies a new susceptibility locus near ODZ4 , 2012, Nature Genetics.

[18]  Gaurav Bhatia,et al.  Fast and accurate imputation of summary statistics enhances evidence of functional enrichment , 2013, Bioinform..

[19]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[20]  Dan L Nicolae,et al.  Testing Untyped Alleles (TUNA)—applications to genome‐wide association studies , 2006, Genetic epidemiology.

[21]  Ayellet V. Segrè,et al.  Hundreds of variants clustered in genomic loci and biological pathways affect human height , 2010, Nature.

[22]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[23]  C. Spencer,et al.  Biological Insights From 108 Schizophrenia-Associated Genetic Loci , 2014, Nature.