Efficient and Tight Upper Bounds for Haplotype Inference by Pure Parsimony Using Delayed Haplotype Selection

Haplotype inference from genotype data is a key step towards a better understanding of the role played by genetic variations on inherited diseases. One of the most promising approaches uses the pure parsimony criterion. This approach is called Haplotype Inference by Pure Parsimony (HIPP) and is NP-hard as it aims at minimising the number of haplotypes required to explain a given set of genotypes. The HIPP problem is often solved using constraint satisfaction techniques, for which the upper bound on the number of required haplotypes is a key issue. Another very well-known approach is Clark's method, which resolves genotypes by greedily selecting an explaining pair of haplotypes. In this work, we combine the basic idea of Clark's method with a more sophisticated method for the selection of explaining haplotypes, in order to explicitly introduce a bias towards parsimonious explanations. This new algorithm can be used either to obtain an approximated solution to the HIPP problem or to obtain an upper bound on the size of the pure parsimony solution. This upper bound can then used to efficiently encode the problem as a constraint satisfaction problem. The experimental evaluation, conducted using a large set of real and artificially generated examples, shows that the new method is much more effective than Clark's method at obtaining parsimonious solutions, while keeping the advantages of simplicity and speed of Clark's method.

[1]  Daniel G. Brown,et al.  A New Integer Programming Formulation for the Pure Parsimony Problem in Haplotype Analysis , 2004, WABI.

[2]  R S Judson,et al.  Complex promoter and coding region beta 2-adrenergic receptor haplotypes alter receptor expression and predict in vivo responsiveness. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Conrad C. Huang,et al.  Sequence diversity and haplotype structure in the human ABCB1 (MDR1, multidrug resistance transporter) gene. , 2003, Pharmacogenetics.

[4]  L. Tsui,et al.  Identification of the cystic fibrosis gene: genetic analysis. , 1989, Science.

[5]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[6]  Dan Gusfield,et al.  Haplotype Inference by Pure Parsimony , 2003, CPM.

[7]  Armin Biere,et al.  Theory and Applications of Satisfiability Testing - SAT 2006, 9th International Conference, Seattle, WA, USA, August 12-15, 2006, Proceedings , 2006, SAT.

[8]  Inês Lynce,et al.  Efficient Haplotype Inference with Boolean Satisfiability , 2006, AAAI.

[9]  R. Adkins,et al.  Comparison of the accuracy of methods of computational haplotype inference using a large empirical dataset , 2004, BMC Genetics.

[10]  L. Tsui,et al.  Erratum: Identification of the Cystic Fibrosis Gene: Genetic Analysis , 1989, Science.

[11]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[12]  Zhaohui S. Qin,et al.  Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[13]  Inês Lynce,et al.  SAT in Bioinformatics: Making the Case with Haplotype Inference , 2006, SAT.

[14]  Dan Geiger,et al.  High density linkage disequilibrium mapping using models of haplotype block variation , 2004, ISMB/ECCB.

[15]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[16]  D. Gusfield,et al.  Analysis and exploration of the use of rule-based algorithms and consensus methods for the inferral of haplotypes. , 2003, Genetics.

[17]  J. Crow,et al.  THE NUMBER OF ALLELES THAT CAN BE MAINTAINED IN A FINITE POPULATION. , 1964, Genetics.

[18]  Daniel G. Brown,et al.  Integer programming approaches to haplotype inference by pure parsimony , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Dan Gusfield,et al.  Inference of Haplotypes from Samples of Diploid Populations: Complexity and Algorithms , 2001, J. Comput. Biol..

[20]  Giuseppe Lancia,et al.  Haplotyping Populations by Pure Parsimony: Complexity of Exact and Approximation Algorithms , 2004, INFORMS J. Comput..

[21]  A. Clark,et al.  Inference of haplotypes from PCR-amplified samples of diploid populations. , 1990, Molecular biology and evolution.

[22]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[23]  M. Rieder,et al.  Sequence variation in the human angiotensin converting enzyme , 1999, Nature Genetics.

[24]  Inês Lynce,et al.  Efficient Haplotype Inference with Pseudo-boolean Optimization , 2007, AB.

[25]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[26]  M. Hodson,et al.  Identification of the cystic fibrosis gene. , 1990, BMJ.

[27]  Ting Chen,et al.  An approximation algorithm for haplotype inference by maximum parsimony. , 2005 .

[28]  Lusheng Wang,et al.  Haplotype inference by maximum parsimony , 2003, Bioinform..