Efficient Genome Wide Tagging by Reduction to SAT

Whole genome association has recently demonstrated some remarkable successes in identifying loci involved in disease. Designing these studies involves selecting a subset of known single nucleotide polymorphisms (SNPs) or tag SNPs to be genotyped. The problem of choosing tag SNPs is an active area of research and is usually formulated such that the goal is to select the fewest number of tag SNPs which "cover" the remaining SNPs where "cover" is defined by some statistical criterion. Since the standard formulation of the tag SNP selection problem is NP-hard, most algorithms for selecting tag SNPs are either heuristics which do not guarantee selection of the minimal set of tag SNPs or are exhaustive algorithms which are computationally impractical. In this paper, we present a set of methods which guarantee discovering the minimal set of tag SNPs, yet in practice are much faster than traditional exhaustive algorithms. We demonstrate that our methods can be applied to discover minimal tag sets for the entire human genome. Our method converts the instance of the tag SNP selection problem to an instance of the satisfiability problem, encoding the instance into conjunctive normal form (CNF). We take advantage of the local structure inherent in human variation, as well as progress in knowledge compilation, and convert our CNF encoding into a representation known as DNNF, from which solutions to our original problem can be easily enumerated. We demonstrate our methods by constructing the optimal tag set for the whole genome and show that we significantly outperform previous exhaustive search-based methods. We also present optimal solutions for the problem of selecting multi-marker tags in which some SNPs are "covered" by a pair of tag SNPs. Multi-marker tags can significantly decrease the number of tags we need to select, however discovering the minimal number of multi-marker tags is much more difficult. We evaluate our methods and perform benchmark comparisons to other methods by choosing tag sets using the HapMap data.

[1]  Adnan Darwiche,et al.  New Advances in Compiling CNF into Decomposable Negation Normal Form , 2004, ECAI.

[2]  Russell Schwartz,et al.  Haplotypes and informative SNP selection algorithms: don't block out information , 2003, RECOMB '03.

[3]  Pierre Marquis,et al.  A Knowledge Compilation Map , 2002, J. Artif. Intell. Res..

[4]  Henry A. Kautz,et al.  Performing Bayesian Inference by Weighted Model Counting , 2005, AAAI.

[5]  Jinbo Huang,et al.  Hierarchical Diagnosis of Multiple Faults , 2007, IJCAI.

[6]  Adnan Darwiche,et al.  Compiling Bayesian Networks with Local Structure , 2005, IJCAI.

[7]  Rolf Haenni,et al.  Logical Compilation of Bayesian Networks , 2006 .

[8]  Blai Bonet,et al.  Pruning Conformant Plans by Counting Models on Compiled d-DNNF Representations , 2005, ICAPS.

[9]  Sorin Istrail,et al.  Optimal Selection of SNP Markers for Disease Association Studies , 2005, Human Heredity.

[10]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[11]  Anthony Barrett,et al.  From Hybrid Systems to Universal Plans via Domain Compilation , 2004, ICAPS.

[12]  Adnan Darwiche,et al.  A compiler for deterministic, decomposable negation normal form , 2002, AAAI/IAAI.

[13]  Adnan Darwiche,et al.  Decomposable negation normal form , 2001, JACM.

[14]  Eran Halperin,et al.  Leveraging the HapMap correlation structure in association studies. , 2007, American journal of human genetics.

[15]  Blai Bonet,et al.  Heuristics for Planning with Penalties and Rewards using Compiled Knowledge , 2006, KR.

[16]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[17]  Pierre Marquis,et al.  Compiling propositional weighted bases , 2004, Artif. Intell..

[18]  Henry A. Kautz,et al.  Solving Bayesian Networks by Weighted Model Counting , 2005 .

[19]  Russell Schwartz,et al.  Optimal Haplotype Block-free Selection of Tagging Snps for Genome-wide Association Studies , 2022 .

[20]  Brian C. Williams,et al.  DNNF-based Belief State Estimation , 2006, AAAI.

[21]  M. Daly,et al.  Evaluating and improving power in whole-genome association studies using fixed marker sets , 2006, Nature Genetics.

[22]  Jinbo Huang,et al.  COMPLAN: A Conformant Probabilistic Planner ⁄ , 2006 .

[23]  Adnan Darwiche,et al.  On the Tractable Counting of Theory Models and its Application to Truth Maintenance and Belief Revision , 2001, J. Appl. Non Class. Logics.

[24]  Maria-Esther Vidal,et al.  Compilation of Query-Rewriting Problems into Tractable Fragments of Propositional Logic , 2006, AAAI.

[25]  Zhaohui S. Qin,et al.  Bioinformatics Original Paper an Efficient Comprehensive Search Algorithm for Tagsnp Selection Using Linkage Disequilibrium Criteria , 2022 .

[26]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[27]  K. Mossman The Wellcome Trust Case Control Consortium, U.K. , 2008 .

[28]  Adnan Darwiche,et al.  On Compiling System Models for Faster and More Scalable Diagnosis , 2005, AAAI.

[29]  Manfred Jaeger,et al.  Compiling relational Bayesian networks for exact inference , 2006, Int. J. Approx. Reason..

[30]  S. Gabriel,et al.  Efficiency and power in genetic association studies , 2005, Nature Genetics.

[31]  Anthony Barrett Model Compilation for Real-Time Planning and Diagnosis with Feedback , 2005, IJCAI.