Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies.

Recent studies have revealed that linkage disequilibrium (LD) patterns vary across the human genome with some regions of high LD interspersed by regions of low LD. A small fraction of SNPs (tag SNPs) is sufficient to capture most of the haplotype structure of the human genome. In this paper, we develop a method to partition haplotypes into blocks and to identify tag SNPs based on genotype data by combining a dynamic programming algorithm for haplotype block partitioning and tag SNP selection based on haplotype data with a variation of the expectation maximization (EM) algorithm for haplotype inference. We assess the effects of using either haplotype or genotype data in haplotype block identification and tag SNP selection as a function of several factors, including sample size, density or number of SNPs studied, allele frequencies, fraction of missing data, and genotyping error rate, using extensive simulations. We find that a modest number of haplotype or genotype samples will result in consistent block partitions and tag SNP selection. The power of association studies based on tag SNPs using genotype data is similar to that using haplotype data.

[1]  K. Kidd,et al.  HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. , 1995, The Journal of heredity.

[2]  N Risch,et al.  The Future of Genetic Studies of Complex Human Diseases , 1996, Science.

[3]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[4]  K. Kidd,et al.  Transmission/disequilibrium tests using multiple tightly linked markers. , 2000, American journal of human genetics.

[5]  K. Kinzler,et al.  Monoallelic mutation analysis (MAMA) for identifying germline mutations , 1995, Nature Genetics.

[6]  Peter Beighton,et al.  de la Chapelle, A. , 1997 .

[7]  Gonçalo R. Abecasis,et al.  GOLD-Graphical Overview of Linkage Disequilibrium , 2000, Bioinform..

[8]  Hongyu Zhao,et al.  On the use of DNA pooling to estimate haplotype frequencies , 2003, Genetic epidemiology.

[9]  Peter H. Westfall,et al.  Testing Association of Statistically Inferred Haplotypes with Discrete and Continuous Traits in Samples of Unrelated Individuals , 2002, Human Heredity.

[10]  W. Ewens,et al.  Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). , 1993, American journal of human genetics.

[11]  J. Schneider,et al.  Direct measurement of the male recombination fraction in the human beta-globin hot spot. , 2002, Human molecular genetics.

[12]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[13]  K K Kidd,et al.  The accuracy of statistical methods for estimation of haplotype frequencies: an example from the CD4 locus. , 2000, American journal of human genetics.

[14]  Zhaohui S. Qin,et al.  Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[15]  J. Wall,et al.  Haplotype blocks and linkage disequilibrium in the human genome , 2003, Nature Reviews Genetics.

[16]  R. Lewontin The Interaction of Selection and Linkage. I. General Considerations; Heterotic Models. , 1964, Genetics.

[17]  J. Long,et al.  An E-M algorithm and testing strategy for multiple-locus haplotypes. , 1995, American journal of human genetics.

[18]  Katherine M Kirk,et al.  The impact of genotyping error on haplotype reconstruction and frequency estimation , 2002, European Journal of Human Genetics.

[19]  J. Akey,et al.  Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. , 2002, American journal of human genetics.

[20]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[21]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[22]  R. Hudson Properties of a neutral allele model with intragenic recombination. , 1983, Theoretical population biology.

[23]  Ting Chen,et al.  Haplotype block partition with limited resources and applications to human chromosome 21 haplotype data. , 2003, American journal of human genetics.

[24]  Pui-Yan Kwok,et al.  Juxtaposed regions of extensive and minimal linkage disequilibrium in human Xq25 and Xq28 , 2000, Nature Genetics.

[25]  B. J. Carey,et al.  Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots , 2003, Nature Genetics.

[26]  Zhaohui S. Qin,et al.  Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[27]  M. Waterman,et al.  A dynamic programming algorithm for haplotype block partitioning , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Andrew G Clark,et al.  Linkage disequilibrium and the mapping of complex human traits. , 2002, Trends in genetics : TIG.

[29]  I. Eisenbarth,et al.  Long-range sequence composition mirrors linkage disequilibrium pattern in a 1.13 Mb region of human chromosome 22. , 2001, Human molecular genetics.

[30]  Eric S. Lander,et al.  Human genome sequence variation and the influence of gene history, mutation and recombination , 2002, Nature Genetics.

[31]  E. Eskin,et al.  Large Scale Recovery of Haplotypes from Genotype Data using Imperfect , 2002, RECOMB 2002.

[32]  E M Wijsman,et al.  Design and sample-size considerations in the detection of linkage disequilibrium with a disease locus. , 1994, American journal of human genetics.

[33]  A. Clark,et al.  Inference of haplotypes from PCR-amplified samples of diploid populations. , 1990, Molecular biology and evolution.

[34]  Heikki Mannila,et al.  An MDL Method for Finding Haplotype Blocks and for Estimating the Strength of Haplotype Block Boundaries , 2002, Pacific Symposium on Biocomputing.

[35]  R. Griffiths,et al.  An ancestral recombination graph , 1997 .

[36]  Fengzhu Sun,et al.  Haplotype block structure and its applications to association studies: power and study designs. , 2002, American journal of human genetics.

[37]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[38]  N. Schork,et al.  Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. , 2000, American journal of human genetics.

[39]  Simon Tavaré,et al.  Linkage disequilibrium: what history has to tell us. , 2002, Trends in genetics : TIG.

[40]  Dan Gusfield,et al.  Haplotyping as perfect phylogeny: conceptual framework and efficient solutions , 2002, RECOMB '02.

[41]  James R. Eshleman,et al.  Conversion of diploidy to haploidy , 2000, Nature.

[42]  A. Chakravarti,et al.  Haplotype inference in random population samples. , 2002, American journal of human genetics.

[43]  John S. Witte,et al.  Haplotype Tagging Single Nucleotide Polymorphisms and Association Studies , 2003, Human Heredity.

[44]  Frank Dudbridge,et al.  Haplotype tagging for the identification of common disease genes , 2001, Nature Genetics.

[45]  E Lai,et al.  The extent of linkage disequilibrium in four populations with distinct demographic histories. , 2000, American journal of human genetics.

[46]  A. Jeffreys,et al.  Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex , 2001, Nature Genetics.

[47]  M. Boehnke,et al.  Experimentally-derived haplotypes substantially increase the efficiency of linkage disequilibrium studies , 2001, Nature Genetics.

[48]  J. Novembre,et al.  Finding haplotype block boundaries by using the minimum-description-length principle. , 2003, American journal of human genetics.

[49]  Dan Gusfield,et al.  Inference of Haplotypes from Samples of Diploid Populations: Complexity and Algorithms , 2001, J. Comput. Biol..

[50]  R. Hudson,et al.  The use of sample genealogies for studying a selectively neutral m-loci model with recombination. , 1985, Theoretical population biology.

[51]  S. Tishkoff,et al.  Molecular haplotyping of genetic markers 10 kb apart by allele-specific long-range PCR. , 1996, Nucleic acids research.

[52]  L. Kruglyak Prospects for whole-genome linkage disequilibrium mapping of common disease genes , 1999, Nature Genetics.

[53]  M. O’Donovan,et al.  DNA Pooling: a tool for large-scale association studies , 2002, Nature Reviews Genetics.