Mixed Integer Linear Programming for Maximum-Parsimony Phylogeny Inference

Reconstruction of phylogenetic trees is a fundamental problem in computational biology. While excellent heuristic methods are available for many variants of this problem, new advances in phylogeny inference will be required if we are to be able to continue to make effective use of the rapidly growing stores of variation data now being gathered. In this paper, we present two integer linear programming (ILP) formulations to find the most parsimonious phylogenetic tree from a set of binary variation data. One method uses a flow-based formulation that can produce exponential numbers of variables and constraints in the worst case. The method has, however, proven extremely efficient in practice on datasets that are well beyond the reach of the available provably efficient methods, solving several large mtDNA and Y-chromosome instances within a few seconds and giving provably optimal results in times competitive with fast heuristics than cannot guarantee optimality. An alternative formulation establishes that the problem can be solved with a polynomial-sized ILP. We further present a web server developed based on the exponential-sized ILP that performs fast maximum parsimony inferences and serves as a front end to a database of precomputed phylogenies spanning the human genome.

[1]  James A. Cuff,et al.  Genome sequence, comparative analysis and haplotype structure of the domestic dog , 2005, Nature.

[2]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[3]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[4]  Jean-Pierre Barthélemy,et al.  From copair hypergraphs to median graphs with latent vertices , 1989, Discret. Math..

[5]  H. Bandelt,et al.  Mitochondrial portraits of human populations using median networks. , 1995, Genetics.

[6]  R. Graham,et al.  The steiner problem in phylogeny is NP-complete , 1982 .

[7]  Dan Gusfield,et al.  Efficient algorithms for inferring evolutionary trees , 1991, Networks.

[8]  David Fernández-Baca,et al.  A Polynomial-Time Algorithm for Near-Perfect Phylogeny , 1996, SIAM J. Comput..

[9]  Dana S. Richards,et al.  Steiner tree problems , 1992, Networks.

[10]  Shibu Yooseph,et al.  Haplotyping as Perfect Phylogeny: A Direct Approach , 2003, J. Comput. Biol..

[11]  Ekta Rai,et al.  Human mtDNA hypervariable regions, HVR I and II, hint at deep common maternal founder and subsequent maternal gene flow in Indian population groups , 2005, Journal of Human Genetics.

[12]  M. Hammer,et al.  High levels of Y-chromosome nucleotide diversity in the genus Pan , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Richard T. Wong,et al.  A dual ascent approach for steiner tree problems on a directed graph , 1984, Math. Program..

[14]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[15]  Giovanna Morelli,et al.  Distinguishing human ethnic groups by means of sequences from Helicobacter pylori: lessons from Ladakh. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  David Fernández-Baca,et al.  A Polynomial-Time Algorithm for the Perfect Phylogeny Problem when the Number of Character States is Fixed , 1994 .

[17]  Guy E. Blelloch,et al.  Simple Reconstruction of Binary Near-Perfect Phylogenetic Trees , 2006, International Conference on Computational Science.

[18]  John E. Beasley An algorithm for the steiner problem in graphs , 1984, Networks.

[19]  Jean L. Chang,et al.  Initial sequence of the chimpanzee genome and comparison with the human genome , 2005, Nature.

[20]  Russell Schwartz,et al.  Optimal imperfect phylogeny reconstruction and haplotyping (IPPH). , 2006, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[21]  Guy E. Blelloch,et al.  Fixed Parameter Tractability of Binary Near-Perfect Phylogenetic Tree Reconstruction , 2006, ICALP.

[22]  Elizabeth M. Smigielski,et al.  dbSNP: a database of single nucleotide polymorphisms , 2000, Nucleic Acids Res..

[23]  Eric S. Lander,et al.  Large-scale discovery and genotyping of single-nucleotide polymorphisms in the mouse , 2000, Nature Genetics.

[24]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[25]  Hans Jürgen Prömel,et al.  The Steiner Tree Problem , 2002 .

[26]  Kári Stefánsson,et al.  mtDNA variation in Inuit populations of Greenland and Canada: migration history and population structure. , 2006, American journal of physical anthropology.

[27]  D. Du,et al.  Steiner Trees in Industry , 2002 .

[28]  David Fernández-Baca,et al.  A Polynomial-Time Algorithm for the Perfect Phylogeny Problem when the Number of Character States is Fixed , 1993, FOCS.

[29]  S. E. Dreyfus,et al.  The steiner problem in graphs , 1971, Networks.

[30]  Guy E. Blelloch,et al.  Efficiently Finding the Most Parsimonious Phylogenetic Tree Via Linear Programming , 2007, ISBRA.

[31]  GusfieldDan Introduction to the IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2004 .

[32]  Cecil M. Lewis,et al.  Land, language, and loci: mtDNA in Native Americans and the genetic history of Peru. , 2005, American journal of physical anthropology.

[33]  Dan Gusfield,et al.  Haplotype Inference by Pure Parsimony , 2003, CPM.

[34]  Sampath Kannan,et al.  A fast algorithm for the computation and enumeration of perfect phylogenies when the number of character states is fixed , 1995, SODA '95.

[35]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[36]  Dan Gusfield,et al.  A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters , 2005, RECOMB.