Missing data imputation and haplotype phase inference for genome-wide association studies

Imputation of missing data and the use of haplotype-based association tests can improve the power of genome-wide association studies (GWAS). In this article, I review methods for haplotype inference and missing data imputation, and discuss their application to GWAS. I discuss common features of the best algorithms for haplotype phase inference and missing data imputation in large-scale data sets, as well as some important differences between classes of methods, and highlight the methods that provide the highest accuracy and fastest computational performance.

[1]  Sharon R Browning,et al.  Multilocus association mapping using variable-length Markov chains. , 2006, American journal of human genetics.

[2]  M. McCarthy,et al.  Replication of Genome-Wide Association Signals in UK Samples Reveals Risk Loci for Type 2 Diabetes , 2007, Science.

[3]  C. Gieger,et al.  Identification of ten loci associated with height highlights new biological pathways in human growth , 2008, Nature Genetics.

[4]  Peter Donnelly,et al.  A statistical method for predicting classical HLA alleles from SNP data. , 2008, American journal of human genetics.

[5]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[6]  Bjarni V. Halldórsson,et al.  A systematic evaluation of 151 candidate genes for their association with osteoporosis and osteoporotic fracture in a meta-analysis of genome-wide association data , 2009 .

[7]  Sharon R Browning,et al.  Estimation of Pairwise Identity by Descent From Dense Genetic Marker Data in a Population Sample of Haplotypes , 2008, Genetics.

[8]  Hannu Toivonen,et al.  HaploRec: efficient and accurate large-scale reconstruction of haplotypes , 2006, BMC Bioinformatics.

[9]  Frank Dudbridge,et al.  Haplotype tagging for the identification of common disease genes , 2001, Nature Genetics.

[10]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[11]  M. Stephens,et al.  Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-data Imputation , 2022 .

[12]  M. McCarthy,et al.  Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes , 2008, Nature Genetics.

[13]  Marcia M. Nizzari,et al.  Genome-Wide Association Analysis Identifies Loci for Type 2 Diabetes and Triglyceride Levels , 2007, Science.

[14]  G. Abecasis,et al.  A Genome-Wide Association Study of Type 2 Diabetes in Finns Detects Multiple Susceptibility Variants , 2007, Science.

[15]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[16]  D. Lin,et al.  Simple and efficient analysis of disease association with missing genotype data. , 2008, American journal of human genetics.

[17]  Judy H. Cho,et al.  Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease , 2008, Nature Genetics.

[18]  Philippe Froguel,et al.  Common genetic variation near MC4R is associated with waist circumference and insulin resistance , 2008, Nature Genetics.

[19]  A. Chakravarti,et al.  Haplotype inference in random population samples. , 2002, American journal of human genetics.

[20]  David Heckerman,et al.  Statistical Resolution of Ambiguous HLA Typing Data , 2008, PLoS Comput. Biol..

[21]  M. Daly,et al.  Guilt beyond a reasonable doubt , 2007, Nature Genetics.

[22]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[23]  B. Browning,et al.  Efficient multilocus association testing for whole genome association studies using localized haplotype clustering , 2007, Genetic epidemiology.

[24]  Eran Halperin,et al.  Haplotype reconstruction from genotype data using Imperfect Phylogeny , 2004, Bioinform..

[25]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[26]  Zhaohui S. Qin,et al.  A comparison of phasing algorithms for trios and unrelated individuals. , 2006, American journal of human genetics.

[27]  H. Lango,et al.  What will whole genome searches for susceptibility genes for common complex disease offer to clinical practice? , 2007, Journal of internal medicine.

[28]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[29]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[30]  R. Collins,et al.  Newly identified loci that influence lipid concentrations and risk of coronary artery disease , 2008, Nature Genetics.

[31]  D. Schaid Evaluating associations of haplotypes with traits , 2004, Genetic epidemiology.

[32]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .

[33]  Dan L Nicolae,et al.  Testing Untyped Alleles (TUNA)—applications to genome‐wide association studies , 2006, Genetic epidemiology.

[34]  D. Clayton,et al.  Genome-wide association studies: theoretical and practical concerns , 2005, Nature Reviews Genetics.

[35]  Kenneth Lange,et al.  A dictionary model for haplotyping, genotype calling, and association testing , 2007, Genetic epidemiology.

[36]  J. Long,et al.  An E-M algorithm and testing strategy for multiple-locus haplotypes. , 1995, American journal of human genetics.

[37]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[38]  Zhaohui S. Qin,et al.  Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[39]  Elizabeth L. Ogburn,et al.  Demonstrating stratification in a European American population , 2005, Nature Genetics.

[40]  M. Daly,et al.  Evaluating and improving power in whole-genome association studies using fixed marker sets , 2006, Nature Genetics.

[41]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[42]  Philip Rosenstiel,et al.  Genome-wide association study for Crohn's disease in the Quebec Founder Population identifies multiple validated disease loci , 2007, Proceedings of the National Academy of Sciences.

[43]  M. Stephens,et al.  Imputation-Based Analysis of Association Studies: Candidate Regions and Quantitative Traits , 2007, PLoS genetics.

[44]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[45]  Juliet M Chapman,et al.  Detecting Disease Associations due to Linkage Disequilibrium Using Haplotype Tags: A Class of Tests and the Determinants of Statistical Power , 2003, Human Heredity.

[46]  Zhaoxia Yu,et al.  Methods to impute missing genotypes for population data , 2007, Human Genetics.

[47]  Frank Dudbridge,et al.  Likelihood-Based Association Analysis for Nuclear Families and Unrelated Subjects with Missing Genotype Data , 2008, Human Heredity.

[48]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[49]  Peter Donnelly,et al.  A comparison of bayesian methods for haplotype reconstruction from population genotype data. , 2003, American journal of human genetics.

[50]  B. Browning,et al.  Haplotypic analysis of Wellcome Trust Case Control Consortium data , 2008, Human Genetics.

[51]  Eran Halperin,et al.  Leveraging the HapMap correlation structure in association studies. , 2007, American journal of human genetics.

[52]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[53]  A. Morris,et al.  Evaluating the effects of imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms. , 2008, American journal of human genetics.

[54]  K. Kidd,et al.  HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. , 1995, The Journal of heredity.