Efficient algorithms for genome-wide tagSNP selection across populations via the linkage disequilibrium criterion.

In this paper, we study the tagSNP selection problem on multiple populations using the pairwise r(2) linkage disequilibrium criterion. We propose a novel combinatorial optimization model for the tagSNP selection problem, called the minimum common tagSNP selection (MCTS) problem, and present efficient solutions for MCTS. Our approach consists of three main steps including (i) partitioning the SNP markers into small disjoint components, (ii) applying some data reduction rules to simplify the problem, and (iii) applying either a fast greedy algorithm or a Lagrangian relaxation algorithm to solve the remaining (general) MCTS. These algorithms also provide lower bounds on tagging (i.e. the minimum number of tagSNPs needed). The lower bounds allow us to evaluate how far our solution is from the optimum. To the best of our knowledge, it is the first time tagging lower bounds are discussed in the literature. We assess the performance of our algorithms on real HapMap data for genome-wide tagging. The experiments demonstrate that our algorithms run 3 to 4 orders of magnitude faster than the existing single-population tagging programs like FESTA, LD-Select and the multiple-population tagging method MultiPop-TagSelect. Our method also greatly reduces the required tagSNPs compared to LD-Select on a single population and MultiPop-TagSelect on multiple populations. Moreover, the numbers of tagSNPs selected by our algorithms are almost optimal since they are very close to the corresponding lower bounds obtained by our method.

[1]  Hadar I. Avi-Itzhak,et al.  Selection of Minimum Subsets of Single Nucleotide Polymorphisms to Capture Haplotype Block Diversity , 2003, Pacific Symposium on Biocomputing.

[2]  Eleftheria Zeggini,et al.  Characterisation of the genomic architecture of human chromosome 17q and evaluation of different methods for haplotype block definition , 2005, BMC Genetics.

[3]  Kui Zhang,et al.  Hapblock: Haplotype Block Partitioning and Tag Snp Selection Software Using a Set of Dynamic Programming Algorithms , 2022 .

[4]  Reuven Bar-Yehuda,et al.  On approximation problems related to the independent set and vertex cover problems , 1984, Discret. Appl. Math..

[5]  Christopher A. Haiman,et al.  Choosing Haplotype-Tagging SNPS Based on Unphased Genotype Data Using a Preliminary Sample of Unrelated Subjects with an Example from the Multiethnic Cohort Study , 2003, Human Heredity.

[6]  Lon R. Cardon,et al.  Efficient selective screening of haplotype tag SNPs , 2003, Bioinform..

[7]  Paola Sebastiani,et al.  Minimal haplotype tagging , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[8]  David B Goldstein,et al.  Genome-wide tagging for everyone , 2006, Nature Genetics.

[9]  S. Gabriel,et al.  The Structure of Haplotype Blocks in the Human Genome , 2002, Science.

[10]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[11]  R. Altman,et al.  Finding haplotype tagging SNPs by use of principal components analysis. , 2004, American journal of human genetics.

[12]  M. Daly,et al.  Transferability of tag SNPs in genetic association studies in multiple populations , 2006, Nature Genetics.

[13]  Geoffrey B. Nilsen,et al.  Whole-Genome Patterns of Common DNA Variation in Three Human Populations , 2005, Science.

[14]  Zhen Lin,et al.  Choosing SNPs using feature selection , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[15]  Lauris Kaplinski,et al.  Pacific Symposium on Biocomputing 11:535-543(2006) THE WHOLE GENOME TAGSNP SELECTION AND TRANSFERABILITY AMONG HAPMAP POPULATIONS , 2022 .

[16]  Anthony J Brookes,et al.  Linkage disequilibrium patterns vary substantially among populations , 2005, European Journal of Human Genetics.

[17]  Jing Zhang,et al.  The effect of haplotype-block definitions on inference of haplotype-block structure and htSNPs selection. , 2005, Molecular biology and evolution.

[18]  Michael Krawczak,et al.  Entropy-based SNP selection for genetic association studies , 2003, Human Genetics.

[19]  Ming Tan,et al.  Genome-Wide Tagging SNPs with Entropy-Based Monte Carlo Method , 2006, J. Comput. Biol..

[20]  Egon Balas,et al.  A Dynamic Subgradient-Based Branch-and-Bound Procedure for Set Covering , 1992, Oper. Res..

[21]  Deborah A. Nickerson,et al.  Efficient selection of tagging single-nucleotide polymorphisms in multiple populations , 2006, Human Genetics.

[22]  D. Nickerson,et al.  Variation is the spice of life , 2001, Nature Genetics.

[23]  Zhaohui S. Qin,et al.  Bioinformatics Original Paper an Efficient Comprehensive Search Algorithm for Tagsnp Selection Using Linkage Disequilibrium Criteria , 2022 .

[24]  Eran Halperin,et al.  Tag SNP selection in genotype data for maximizing SNP prediction accuracy , 2005, ISMB.

[25]  Kun Zhang,et al.  HaploBlockFinder: Haplotype Block Analyses , 2003, Bioinform..

[26]  Frank Dudbridge,et al.  Haplotype tagging for the identification of common disease genes , 2001, Nature Genetics.

[27]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[28]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[29]  D. Conrad,et al.  A worldwide survey of haplotype variation and linkage disequilibrium in the human genome , 2006, Nature Genetics.

[30]  J. Akey,et al.  Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. , 2002, American journal of human genetics.