Accounting for Decay of Linkage Disequilibrium in Haplotype Inference and Missing-data Imputation

Although many algorithms exist for estimating haplotypes from genotype data, none of them take full account of both the decay of linkage disequilibrium (LD) with distance and the order and spacing of genotyped markers. Here, we describe an algorithm that does take these factors into account, using a flexible model for the decay of LD with distance that can handle both "blocklike" and "nonblocklike" patterns of LD. We compare the accuracy of this approach with a range of other available algorithms in three ways: for reconstruction of randomly paired, molecularly determined male X chromosome haplotypes; for reconstruction of haplotypes obtained from trios in an autosomal region; and for estimation of missing genotypes in 50 autosomal genes that have been completely resequenced in 24 African Americans and 23 individuals of European descent. For the autosomal data sets, our new approach clearly outperforms the best available methods, whereas its accuracy in inferring the X chromosome haplotypes is only slightly superior. For estimation of missing genotypes, our method performed slightly better when the two subsamples were combined than when they were analyzed separately, which illustrates its robustness to population stratification. Our method is implemented in the software package PHASE (v2.1.1), available from the Stephens Lab Web site.

[1]  L. Partridge,et al.  Oxford Surveys in Evolutionary Biology , 1991 .

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  R. Hudson Gene genealogies and the coalescent process. , 1990 .

[4]  A. Clark,et al.  Inference of haplotypes from PCR-amplified samples of diploid populations. , 1990, Molecular biology and evolution.

[5]  G. Evans Practical Numerical Integration , 1993 .

[6]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[7]  K. Kidd,et al.  HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. , 1995, The Journal of heredity.

[8]  J. Long,et al.  An E-M algorithm and testing strategy for multiple-locus haplotypes. , 1995, American journal of human genetics.

[9]  David Maxwell Chickering,et al.  Dependency Networks for Inference, Collaborative Filtering, and Data Visualization , 2000, J. Mach. Learn. Res..

[10]  P. Donnelly,et al.  Inference in molecular population genetics , 2000 .

[11]  N. Schork,et al.  Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. , 2000, American journal of human genetics.

[12]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[13]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[14]  Frank Dudbridge,et al.  Haplotype tagging for the identification of common disease genes , 2001, Nature Genetics.

[15]  Zhaohui S. Qin,et al.  Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[16]  A. Chakravarti,et al.  Haplotype inference in random population samples. , 2002, American journal of human genetics.

[17]  Zhaohui S. Qin,et al.  Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[18]  R. Karp,et al.  Efficient reconstruction of haplotype structure via perfect phylogeny. , 2002, Journal of bioinformatics and computational biology.

[19]  Richard M. Karp,et al.  Large scale reconstruction of haplotypes from genotype data , 2003, RECOMB '03.

[20]  G. McVean,et al.  Estimating recombination rates from population-genetic data , 2003, Nature Reviews Genetics.

[21]  Dan Geiger,et al.  Model-based inference of haplotype block variation , 2003, RECOMB '03.

[22]  M. Stephens,et al.  Modelling Linkage Disequilibrium , And Identifying Recombination Hotspots Using SNP Data , 2003 .

[23]  Juliet M Chapman,et al.  Detecting Disease Associations due to Linkage Disequilibrium Using Haplotype Tags: A Class of Tests and the Determinants of Statistical Power , 2003, Human Heredity.

[24]  Peter Donnelly,et al.  A comparison of bayesian methods for haplotype reconstruction from population genotype data. , 2003, American journal of human genetics.

[25]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[26]  L. Excoffier,et al.  Gametic phase estimation over large genomic regions using an adaptive window approach , 2003, Human Genomics.

[27]  Deborah A. Nickerson,et al.  Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans , 2003, Nature Genetics.

[28]  Ron Shamir,et al.  Maximum likelihood resolution of multi-block genotypes , 2004, RECOMB.

[29]  Dana C Crawford,et al.  Evidence for substantial fine-scale variation in recombination rates across the human genome , 2004, Nature Genetics.

[30]  P. Donnelly,et al.  The Fine-Scale Structure of Recombination Rate Variation in the Human Genome , 2004, Science.

[31]  A. Chakravarti,et al.  Haplotype and missing data inference in nuclear families. , 2004, Genome research.

[32]  Matthew Stephens,et al.  Absence of the TAP2 Human Recombination Hotspot in Chimpanzees , 2004, PLoS biology.

[33]  Stephen F. Schaffner,et al.  The X chromosome in population genetics , 2004, Nature Reviews Genetics.

[34]  Eran Halperin,et al.  Haplotype reconstruction from genotype data using Imperfect Phylogeny , 2004, Bioinform..