The pure parsimony haplotyping problem: overview and computational advances

Haplotyping estimation from aligned single-nucleotide polymorphism fragments has attracted more and more attention in recent years due to its importance in analysis of many fine-scale genetic data. Its application fields range from mapping of complex disease genes to inferring population histories, passing through designing drugs, functional genomics, and pharmacogenetics. The literature proposes a number of estimation criteria to select a set of haplotypes among possible alternatives. Usually, such criteria can be expressed under the form of objective functions, and the sets of haplotypes that optimize them are referred to as optimal. One of the most important estimation criteria is the pure parsimony, which states that the optimal set of haplotypes for a given set of genotypes is that having minimal cardinality. Finding the minimal number of haplotypes necessary to explain a given set of genotypes involves solving an optimization problem, called the pure parsimony haplotyping (PPH) estimation problem, which is notoriously -hard. This article provides an overview of PPH, and discusses the different approaches to solution that occur in the literature.

[1]  Shibu Yooseph,et al.  A Survey of Computational Methods for Determining Haplotypes , 2002, Computational Methods for SNPs and Haplotype Inference.

[2]  Steuart Rorke,et al.  Association of the ADAM33 gene with asthma and bronchial hyperresponsiveness , 2002, Nature.

[3]  Christian Blum,et al.  Metaheuristics in combinatorial optimization: Overview and conceptual comparison , 2003, CSUR.

[4]  Giuseppe Lancia,et al.  Haplotyping Populations by Pure Parsimony: Complexity of Exact and Approximation Algorithms , 2004, INFORMS J. Comput..

[5]  B. Dahlbäck,et al.  Resistance to Activated Protein C Caused by the Factor V R506Q Mutation ls a Common Risk Factor for Venous Thrombosis , 1997, Thrombosis and Haemostasis.

[6]  Eric S. Lander,et al.  The common PPARγ Pro12Ala polymorphism is associated with decreased risk of type 2 diabetes , 2000, Nature Genetics.

[7]  Wen-Hsiung Li,et al.  Low nucleotide diversity in man. , 1991, Genetics.

[8]  Peter Donnelly,et al.  A comparison of bayesian methods for haplotype reconstruction from population genotype data. , 2003, American journal of human genetics.

[9]  Mourad Sahbatou,et al.  Association of NOD2 leucine-rich repeat variants with susceptibility to Crohn's disease , 2001, Nature.

[10]  Daniel G. Brown,et al.  A New Integer Programming Formulation for the Pure Parsimony Problem in Haplotype Analysis , 2004, WABI.

[11]  Martine Labbé,et al.  Solving haplotyping inference parsimony problem using a new basic polynomial formulation , 2008, Comput. Math. Appl..

[12]  M. Daly,et al.  Guilt by association , 2000, Nature Genetics.

[13]  Inês Lynce,et al.  Efficient Haplotype Inference with Boolean Satisfiability , 2006, AAAI.

[14]  C. Nusbaum,et al.  Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. , 1998, Science.

[15]  N. Shen,et al.  Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis , 1999, Nature Genetics.

[16]  Jonathan P. Bollback,et al.  Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology , 2001, Science.

[17]  Leo van Iersel,et al.  On the Complexity of Several Haplotyping Problems , 2005, WABI.

[18]  Victor A. Campos,et al.  On the asymmetric representatives formulation for the vertex coloring problem , 2005, Discret. Appl. Math..

[19]  B. Dastugue,et al.  [Apolipoprotein E and Alzheimer's disease]. , 1998, Annales de biologie clinique.

[20]  J. Todd,et al.  The CTLA-4 gene region of chromosome 2q33 is linked to, and associated with, type 1 diabetes. Belgian Diabetes Registry. , 1996, Human molecular genetics.

[21]  T. Britton,et al.  Reliability of Bayesian posterior probabilities and bootstrap frequencies in phylogenetics. , 2003, Systematic biology.

[22]  Giuseppe Lancia,et al.  A Set-Covering Approach with Column Generation for Parsimony Haplotyping , 2009, INFORMS J. Comput..

[23]  Inês Lynce,et al.  Efficient Haplotype Inference with Pseudo-boolean Optimization , 2007, AB.

[24]  Tianhua Niu,et al.  Haplotype information and linkage disequilibrium mapping for single nucleotide polymorphisms. , 2003, Genome research.

[25]  Ting Chen,et al.  An approximation algorithm for haplotype inference by maximum parsimony , 2005, SAC '05.

[26]  Judy H. Cho,et al.  [Letters to Nature] , 1975, Nature.

[27]  Jonathan C. Cohen,et al.  An Apolipoprotein Influencing Triglycerides in Humans and Mice Revealed by Comparative Sequencing , 2001, Science.

[28]  Giuseppe Lancia,et al.  A polynomial case of the parsimony haplotyping problem , 2006, Oper. Res. Lett..

[29]  M. Cargill Characterization of single-nucleotide polymorphisms in coding regions of human genes , 1999, Nature Genetics.

[30]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[31]  Sinead B. O'Leary,et al.  Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease , 2001, Nature Genetics.

[32]  Zhaohui S. Qin,et al.  Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[33]  Inês Lynce,et al.  SAT in Bioinformatics: Making the Case with Haplotype Inference , 2006, SAT.

[34]  Hiroshi Sato,et al.  Functional SNPs in the lymphotoxin-α gene that are associated with susceptibility to myocardial infarction , 2002, Nature Genetics.

[35]  Daniel G. Brown,et al.  Integer programming approaches to haplotype inference by pure parsimony , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  G. Rand International Federation of Operational Research Societies , 2013, Nature.

[37]  P. Tam The International HapMap Consortium. The International HapMap Project (Co-PI of Hong Kong Centre which responsible for 2.5% of genome) , 2003 .

[38]  Andrea Roli,et al.  Stochastic local search for large-scale instances of the haplotype inference problem by pure parsimony , 2008, J. Algorithms.

[39]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[40]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[41]  Konstantinos Kalpakis,et al.  Haplotype phasing using semidefinite programming , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[42]  R. Tisch,et al.  Insulin-Dependent Diabetes Mellitus , 1996, Cell.

[43]  Xiang-Sun Zhang,et al.  Haplotype Inference by Pure Parsimony via Genetic Algorithm , 1997 .

[44]  Dan Gusfield,et al.  Inference of Haplotypes from Samples of Diploid Populations: Complexity and Algorithms , 2001, J. Comput. Biol..

[45]  R E LaPorte,et al.  Worldwide differences in the incidence of type I diabetes are associated with amino acid variation at position 57 of the HLA-DQ beta chain. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Luonan Chen,et al.  A parsimonious tree-grow method for haplotype inference , 2005, Bioinform..

[47]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[48]  Johan Auwerx,et al.  A Pro12Ala substitution in PPARγ2 associated with decreased receptor activity, lower body mass index and improved insulin sensitivity , 1998, Nature Genetics.

[49]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[50]  Martine Labbé,et al.  A Class Representative Model for Pure Parsimony Haplotyping , 2010, INFORMS J. Comput..

[51]  Daniele Catanzaro,et al.  The minimum evolution problem: Overview and classification , 2009, Networks.

[52]  Tao Jiang,et al.  Efficient Inference of Haplotypes from Genotypes on a Pedigree , 2003, J. Bioinform. Comput. Biol..

[53]  H. Stefánsson,et al.  Neuregulin 1 and susceptibility to schizophrenia. , 2002, American journal of human genetics.

[54]  Michael Dean,et al.  Statistical estimation and pedigree analysis ofCCR2-CCR5 haplotypes , 2001, Human Genetics.

[55]  A. Clark,et al.  Inference of haplotypes from PCR-amplified samples of diploid populations. , 1990, Molecular biology and evolution.

[56]  Mark Gurney,et al.  The gene encoding phosphodiesterase 4D confers risk of ischemic stroke , 2003, Nature Genetics.

[57]  Luonan Chen,et al.  Models and Algorithms for Haplotyping Problem , 2006 .

[58]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[59]  Dan Gusfield,et al.  Haplotype Inference by Pure Parsimony , 2003, CPM.

[60]  Shibu Yooseph,et al.  Combinatorial Problems Arising in SNP and Haplotype Analysis , 2003, DMTCS.

[61]  Michel Gendreau,et al.  Metaheuristics in Combinatorial Optimization , 2022 .

[62]  J. Karam,et al.  A Polymorphic Locus Near the Human Insulin Gene Is Associated with Insulin-dependent Diabetes Melliitus , 1984, Diabetes.

[63]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[64]  Inês Lynce,et al.  Efficient and Tight Upper Bounds for Haplotype Inference by Pure Parsimony Using Delayed Haplotype Selection , 2007, EPIA Workshops.

[65]  Paola Bonizzoni,et al.  The Haplotyping problem: An overview of computational models and solutions , 2003, Journal of Computer Science and Technology.

[66]  N Risch,et al.  The Future of Genetic Studies of Complex Human Diseases , 1996, Science.

[67]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[68]  N. Schork,et al.  Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. , 2000, American journal of human genetics.

[69]  R. Adkins,et al.  Comparison of the accuracy of methods of computational haplotype inference using a large empirical dataset , 2004, BMC Genetics.

[70]  Lusheng Wang,et al.  Haplotype inference by maximum parsimony , 2003, Bioinform..

[71]  Zhaohui S. Qin,et al.  Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[72]  B. Larget,et al.  Markov Chain Monte Carlo Algorithms for the Bayesian Analysis of Phylogenetic Trees , 2000 .

[73]  Giuseppe Lancia,et al.  The phasing of heterozygous traits: Algorithms and complexity , 2008, Comput. Math. Appl..