Efficient Genome-Wide TagSNP Selection Across Populations via the Linkage Disequilibrium Criterion

In this article, we studied the tag single-nucleotide polymorphism (tagSNP) selection problem on multiple populations using the pairwise r(2) linkage disequilibrium criterion. We proposed a novel combinatorial optimization model for the tagSNP selection problem, called the minimum common tagSNP selection (MCTS) problem, and presented efficient solutions for MCTS. Our approach consists of the following three main steps: (i) partitioning the SNP markers into small disjoint components, (ii) applying some data reduction rules to simplify the problem, and (iii) applying either a fast greedy algorithm or a Lagrangian relaxation algorithm to solve the remaining (general) MCTS. These algorithms also provide lower bounds on tagging (i.e., the minimum number of tagSNPs needed). The lower bounds allow us to evaluate how far our solution is from the optimum. To the best of our knowledge, it is the first time the tagging lower bounds are discussed in the literature. We assessed the performance of our algorithms on real HapMap data for genome-wide tagging. The experiments demonstrated that our algorithms run 3-4 orders of magnitude faster than the existing single-population tagging programs such as FESTA, LD-Select, and the multiple-population tagging method MultiPop-TagSelect. Our method also greatly reduced the required tagSNPs compared with LD-Select on a single population and MultiPop-TagSelect on multiple populations. Moreover, the numbers of tagSNPs selected by our algorithms are almost optimal because they are very close to the corresponding lower bounds obtained by our method.

[1]  D. Conrad,et al.  A worldwide survey of haplotype variation and linkage disequilibrium in the human genome , 2006, Nature Genetics.

[2]  Frank Dudbridge,et al.  Haplotype tagging for the identification of common disease genes , 2001, Nature Genetics.

[3]  Kui Zhang,et al.  Hapblock: Haplotype Block Partitioning and Tag Snp Selection Software Using a Set of Dynamic Programming Algorithms , 2022 .

[4]  D. Nickerson,et al.  Variation is the spice of life , 2001, Nature Genetics.

[5]  Zhaohui S. Qin,et al.  Bioinformatics Original Paper an Efficient Comprehensive Search Algorithm for Tagsnp Selection Using Linkage Disequilibrium Criteria , 2022 .

[6]  Zhen Lin,et al.  Choosing Snps Using Feature Selection , 2006, J. Bioinform. Comput. Biol..

[7]  M. Waterman,et al.  A dynamic programming algorithm for haplotype block partitioning , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Reuven Bar-Yehuda,et al.  On approximation problems related to the independent set and vertex cover problems , 1984, Discret. Appl. Math..

[9]  Lauris Kaplinski,et al.  Pacific Symposium on Biocomputing 11:535-543(2006) THE WHOLE GENOME TAGSNP SELECTION AND TRANSFERABILITY AMONG HAPMAP POPULATIONS , 2022 .

[10]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[11]  Tao Jiang,et al.  Efficient algorithms for genome-wide tagSNP selection across populations via the linkage disequilibrium criterion. , 2007, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[12]  Lon R. Cardon,et al.  Efficient selective screening of haplotype tag SNPs , 2003, Bioinform..

[13]  Paola Sebastiani,et al.  Minimal haplotype tagging , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Egon Balas,et al.  A Dynamic Subgradient-Based Branch-and-Bound Procedure for Set Covering , 1992, Oper. Res..

[15]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[16]  R. Altman,et al.  Finding haplotype tagging SNPs by use of principal components analysis. , 2004, American journal of human genetics.

[17]  Deborah A. Nickerson,et al.  Efficient selection of tagging single-nucleotide polymorphisms in multiple populations , 2006, Human Genetics.

[18]  Hadar I. Avi-Itzhak,et al.  Selection of Minimum Subsets of Single Nucleotide Polymorphisms to Capture Haplotype Block Diversity , 2003, Pacific Symposium on Biocomputing.

[19]  Eleftheria Zeggini,et al.  Characterisation of the genomic architecture of human chromosome 17q and evaluation of different methods for haplotype block definition , 2005, BMC Genetics.

[20]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[21]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[22]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[23]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[24]  Ming Tan,et al.  Genome-Wide Tagging SNPs with Entropy-Based Monte Carlo Method , 2006, J. Comput. Biol..

[25]  Eran Halperin,et al.  Tag SNP selection in genotype data for maximizing SNP prediction accuracy , 2005, ISMB.

[26]  Pui-Yan Kwok,et al.  Faculty Opinions recommendation of Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. , 2003 .

[27]  M. Daly,et al.  Transferability of tag SNPs in genetic association studies in multiple populations , 2006, Nature Genetics.

[28]  Geoffrey B. Nilsen,et al.  Whole-Genome Patterns of Common DNA Variation in Three Human Populations , 2005, Science.

[29]  Anthony J Brookes,et al.  Linkage disequilibrium patterns vary substantially among populations , 2005, European Journal of Human Genetics.

[30]  Jing Zhang,et al.  The effect of haplotype-block definitions on inference of haplotype-block structure and htSNPs selection. , 2005, Molecular biology and evolution.

[31]  Michael Krawczak,et al.  Entropy-based SNP selection for genetic association studies , 2003, Human Genetics.

[32]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[33]  Christoph M. Friedrich,et al.  Selection of representative SNP sets for genome-wide association studies: a metaheuristic approach , 2012, Optim. Lett..

[34]  J. Akey,et al.  Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. , 2002, American journal of human genetics.

[35]  Christopher A. Haiman,et al.  Choosing Haplotype-Tagging SNPS Based on Unphased Genotype Data Using a Preliminary Sample of Unrelated Subjects with an Example from the Multiethnic Cohort Study , 2003, Human Heredity.

[36]  David B Goldstein,et al.  Genome-wide tagging for everyone , 2006, Nature Genetics.

[37]  Kun Zhang,et al.  HaploBlockFinder: Haplotype Block Analyses , 2003, Bioinform..