Performance of Genotype Imputations Using Data from the 1000 Genomes Project

Genotype imputations based on 1000 Genomes (1KG) Project data have the advantage of imputing many more SNPs than imputations based on HapMap data. It also provides an opportunity to discover associations with relatively rare variants. Recent investigations are increasingly using 1KG data for genotype imputations, but only limited evaluations of the performance of this approach are available. In this paper, we empirically evaluated imputation performance using 1KG data by comparing imputation results to those using the HapMap Phase II data that have been widely used. We used three reference panels: the CEU panel consisting of 120 haplotypes from HapMap II and 1KG data (June 2010 release) and the EUR panel consisting of 566 haplotypes also from 1KG data (August 2010 release). We used Illumina 324,607 autosomal SNPs genotyped in 501 individuals of European ancestry. Our most important finding was that both 1KG reference panels provided much higher imputation yield than the HapMap II panel. There were more than twice as many successfully imputed SNPs as there were using the HapMap II panel (6.7 million vs. 2.5 million). Our second most important finding was that accuracy using both 1KG panels was high and almost identical to accuracy using the HapMap II panel. Furthermore, after removing SNPs with MACH Rsq <0.3, accuracy for both rare and low frequency SNPs was very high and almost identical to accuracy for common SNPs. We found that imputation using the 1KG-EUR panel had advantages in successfully imputing rare, low frequency and common variants. Our findings suggest that 1KG-based imputation can increase the opportunity to discover significant associations for SNPs across the allele frequency spectrum. Because the 1KG Project is still underway, we expect that later versions will provide even better imputation performance.

[1]  J. Marchini,et al.  Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[2]  W. Kraus,et al.  Genomic Predictors of Maximal Oxygen Uptake 4 Response to Standardized Exercise Training Programs , 2010 .

[3]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .

[4]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[5]  A. Morris,et al.  Evaluating the effects of imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms. , 2008, American journal of human genetics.

[6]  Sharon R Grossman,et al.  Integrating common and rare genetic variation in diverse human populations , 2010, Nature.

[7]  G. Abecasis,et al.  Low-coverage sequencing: implications for design of complex trait association studies. , 2011, Genome research.

[8]  Yun Li,et al.  To identify associations with rare variants, just WHaIT: Weighted haplotype and imputation-based tests. , 2010, American journal of human genetics.

[9]  Gianmauro Cuccuru,et al.  Variants within the immunoregulatory CBLB gene are associated with multiple sclerosis , 2010, Nature Genetics.

[10]  Michael Krawczak,et al.  A comprehensive evaluation of SNP genotype imputation , 2009, Human Genetics.

[11]  Hong-Wen Deng,et al.  Analyses and Comparison of Accuracy of Different Genotype Imputation Methods , 2008, PloS one.

[12]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[13]  Luke Jostins,et al.  Imputation of low-frequency variants using the HapMap3 benefits from large, diverse reference sets , 2011, European Journal of Human Genetics.

[14]  Manuel A. R. Ferreira,et al.  Practical aspects of imputation-driven meta-analysis of genome-wide association studies. , 2008, Human molecular genetics.

[15]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[16]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[17]  J. Wilmore,et al.  The HERITAGE family study. Aims, design, and measurement protocol. , 1995, Medicine and science in sports and exercise.

[18]  Inês Barroso,et al.  Meta-analysis and imputation refines the association of 15q25 with smoking quantity , 2010, Nature Genetics.

[19]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[20]  Brooke L. Fridley,et al.  Utilizing Genotype Imputation for the Augmentation of Sequence Data , 2010, PloS one.

[21]  G. Abecasis,et al.  A Genome-Wide Association Study of Type 2 Diabetes in Finns Detects Multiple Susceptibility Variants , 2007, Science.

[22]  P. Donnelly,et al.  Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip , 2009, PLoS genetics.

[23]  Gonçalo Abecasis,et al.  Genotype-imputation accuracy across worldwide human populations. , 2009, American journal of human genetics.

[24]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[25]  Si Quang Le,et al.  SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. , 2011, Genome research.

[26]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[27]  G. Abecasis,et al.  MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes , 2010, Genetic epidemiology.

[28]  Claude Bouchard,et al.  Genomic predictors of the maximal O₂ uptake response to standardized exercise training programs. , 2011, Journal of applied physiology.

[29]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[30]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[31]  Tariq Ahmad,et al.  Meta-analysis and imputation refines the association of 15q25 with smoking quantity , 2010, Nature Genetics.

[32]  Sharon R. Browning,et al.  Missing data imputation and haplotype phase inference for genome-wide association studies , 2008, Human Genetics.

[33]  Eric E Schadt,et al.  Accuracy of Genome-wide Imputation of Untyped Markers and Impacts on Statistical Power for Association Studies , 2009 .

[34]  Christian Gieger,et al.  Genome-wide association study identifies a psoriasis susceptibility locus at TRAF3IP2 , 2010, Nature Genetics.

[35]  M. Stephens,et al.  Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits , 2007, PLoS genetics.

[36]  B. Browning,et al.  A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. , 2009, American journal of human genetics.