A new framework for the selection of tag SNPs by multimarker haplotypes

This paper proposes a new framework for the selection of tag SNPs based on haplotypes instead of on a single SNP. The tag SNPs found by this framework form a set of haplotypes completely predictive of the alleles of all untyped SNPs. We refer to this problem as MTMH, which is defined as follows: given a set of SNPs, find a minimum subset of SNPs (called tag SNPs) which defines a set of haplotypes completely predictive of the alleles of all untyped SNPs. The MTMH problem is solved by dividing into three subproblems, two of which are shown to be NP-hard. Several exact and approximation algorithms are proposed to solve these subproblems. We describe a framework which integrates these algorithms and develop a program called HapTagger for finding tag SNPs. HapTagger is compared with existing methods as well as the official tagging tool (called Haploview) of the International HapMap project using a variety of real data sets. Our theoretical analysis and experimental results indicate that HapTagger consistently identifies a smaller set of tag SNPs and runs much faster than existing methods. HapTagger avoids the need of incorporating a linkage disequilibrium statistic and thus significantly improves the computational efficiency. We also present an algorithm (specific to HapTagger) for reconstructing alleles of untyped SNPs. It is worth mentioning that these predictive haplotypes selected by HapTagger can be used as signatures of recent positive selection or co-evolution. HapTagger is available at http://www.csie.ntu.edu.tw/~kmchao/tools/HapTagger/.

[1]  Ting Chen,et al.  An approximation algorithm for haplotype inference by maximum parsimony. , 2005 .

[2]  Ting Chen,et al.  Haplotype block partition with limited resources and applications to human chromosome 21 haplotype data. , 2003, American journal of human genetics.

[3]  Jing Huang,et al.  Algorithms for large-scale genotyping microarrays , 2003, Bioinform..

[4]  S. Gabriel,et al.  Efficiency and power in genetic association studies , 2005, Nature Genetics.

[5]  Nicholas W Wood,et al.  Selection and evaluation of tagging SNPs in the neuronal-sodium-channel gene SCN1A: implications for linkage-disequilibrium gene mapping. , 2003, American journal of human genetics.

[6]  Joseph Naor,et al.  Approximating Minimum Feedback Sets and Multicuts in Directed Graphs , 1998, Algorithmica.

[7]  Zhaohui S. Qin,et al.  Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[8]  Geoffrey B. Nilsen,et al.  Whole-Genome Patterns of Common DNA Variation in Three Human Populations , 2005, Science.

[9]  Dorit S. Hochbaum,et al.  Approximation Algorithms for the Set Covering and Vertex Cover Problems , 1982, SIAM J. Comput..

[10]  Aravinda Chakravarti,et al.  Exhaustive allelic transmission disequilibrium tests as a new approach to genome-wide association studies , 2004, Nature Genetics.

[11]  Peter Donnelly,et al.  A comparison of bayesian methods for haplotype reconstruction from population genotype data. , 2003, American journal of human genetics.

[12]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[13]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[14]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[15]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[16]  Kun-Mao Chao,et al.  A greedier approach for finding tag SNPs , 2006, Bioinform..

[17]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[18]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[19]  J. Pritchard,et al.  A Map of Recent Positive Selection in the Human Genome , 2006, PLoS biology.

[20]  Ting Chen,et al.  Selecting additional tag SNPs for tolerating missing data in genotyping , 2005, BMC Bioinformatics.

[21]  Dana C Crawford,et al.  Definition and clinical importance of haplotypes. , 2005, Annual review of medicine.

[22]  Christopher A. Haiman,et al.  Choosing Haplotype-Tagging SNPS Based on Unphased Genotype Data Using a Preliminary Sample of Unrelated Subjects with an Example from the Multiethnic Cohort Study , 2003, Human Heredity.

[23]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[24]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[25]  Nan Hu,et al.  Genome-wide association study in esophageal cancer using GeneChip mapping 10K array. , 2005, Cancer research.

[26]  Ting Chen,et al.  Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies. , 2004, Genome research.

[27]  L. Helmuth Genome research: map of the human genome 3.0. , 2001, Science.

[28]  Mihalis Yannakakis,et al.  Optimization, approximation, and complexity classes , 1991, STOC '88.

[29]  Pardis C Sabeti,et al.  A high-resolution HLA and SNP haplotype map for disease association studies in the extended human MHC , 2006, Nature Genetics.

[30]  Mark Daly,et al.  Haploview: analysis and visualization of LD and haplotype maps , 2005, Bioinform..

[31]  M. Boehnke,et al.  Experimentally-derived haplotypes substantially increase the efficiency of linkage disequilibrium studies , 2001, Nature Genetics.

[32]  S. P. Fodor,et al.  Large-scale genotyping of complex DNA , 2003, Nature Biotechnology.

[33]  Zhaohui S. Qin,et al.  Bioinformatics Original Paper an Efficient Comprehensive Search Algorithm for Tagsnp Selection Using Linkage Disequilibrium Criteria , 2022 .

[34]  L. Helmuth Map of the Human Genome 3.0 , 2001, Science.

[35]  Russell Schwartz,et al.  Haplotypes and informative SNP selection algorithms: don't block out information , 2003, RECOMB '03.

[36]  Sio Iong Ao,et al.  CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs , 2005, Bioinform..

[37]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.