Inference of parsimonious species phylogenies from multi-locus data

The main focus of this dissertation is the inference of species phylogenies, i.e. evolutionary histories of species. Species phylogenies allow us to gain insights into the mechanisms of evolution and to hypothesize past evolutionary events. They also find applications in medicine, for example, the understanding of antibiotic resistance in bacteria. The reconstruction of species phylogenies is, therefore, of both biological and practical importance. In the traditional method for inferring species trees from genetic data, we sequence a single locus in species genomes, reconstruct a gene tree, and report it as the species tree. Biologists have long acknowledged that a gene tree can be different from a species tree, thus implying that this traditional method might infer the wrong species tree. Moreover, reticulate events such as horizontal gene transfer and hybridization make the evolution of species no longer tree-like. The availability of multi-locus data provides us with excellent opportunities to resolve those long standing problems. In this dissertation, we present parsimony-based algorithms for reconciling species/gene tree incongruence that is assumed to be due solely to lineage sorting. We also describe a unified framework for detecting hybridization despite lineage sorting. To address the first problem of species/gene tree incongruence caused by lineage sorting, we present three algorithms. In Chapter 3, we present an algorithm based on an integer-linear programming (ILP) formula to infer the species tree's topology and divergence times from multiple gene trees. In Chapter 4, we describe two methods that infer the species tree by minimizing deep coalescences (MDC), a criterion introduced by Maddison in 1997. The first method is also based on an ILP formula, but it eliminates the enumeration phase of candidate species trees of the algorithm in Chapter 3. The second algorithm further eliminates the dependence on external ILP solvers by employing dynamic programming. We ran those methods on both biological and simulated data, and experimental results demonstrate their high accuracy and speed in species tree inference, which makes them suitable for analyzing multi-locus data. The second problem this dissertation deals with is reticulation (e.g., horizontal gene transfer, hybridization) detection despite lineage sorting. The phylogeny-based approach compares the evolutionary histories of different genomic regions and test them for incongruence that would indicate hybridization. However, since species tree and gene tree incongruence can also be due to lineage sorting, phylogeny-based hybridization methods might overestimate the amount of hybridization. We present in this dissertation a framework that can handle both hybridization and lineage sorting simultaneously. In this framework, we extend the MDC criterion to phylogenetic networks, and use it to propose a heuristic to detect hybridization despite lineage sorting. Empirical results on a simulated and a yeast data set show its promising performance, as well as several directions for future research.

[1]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[2]  Luay Nakhleh,et al.  Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. , 2011, Systematic biology.

[3]  Daniel H. Huson,et al.  SplitsTree-a program for analyzing and visualizing evolutionary data , 1997 .

[4]  Vladimir Makarenkov,et al.  New Efficient Algorithm for Detection of Horizontal Gene Transfer Events , 2003, WABI.

[5]  M. Gerstein,et al.  Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. , 2000, Genome research.

[6]  S. Ferriera,et al.  Supporting Online Material Materials and Methods Figs. S1 and S2 Tables S1 and S2 References Temporal Fragmentation of Speciation in Bacteria , 2022 .

[7]  John M. Mellor-Crummey,et al.  Reconstructing phylogenetic networks using maximum parsimony , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[8]  Luay Nakhleh,et al.  Species Tree Inference by Minimizing Deep Coalescences , 2009, PLoS Comput. Biol..

[9]  Blanchette,et al.  Breakpoint Phylogenies. , 1997, Genome informatics. Workshop on Genome Informatics.

[10]  Loren H Rieseberg,et al.  Reconstructing patterns of reticulate evolution in plants. , 2004, American journal of botany.

[11]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[12]  Glenn Hickey,et al.  SPR Distance Computation for Unrooted Trees , 2008, Evolutionary bioinformatics online.

[13]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[14]  Vincent Moulton,et al.  NeighborNet: An Agglomerative Method for the Construction of Planar Phylogenetic Networks , 2002, WABI.

[15]  P. Meisel Margaret O. Dayhoff: Atlas of Protein Sequence and Structure 1969 (Volume 4) XXIV u. 361 S., 21 Ausklapptafeln, 68 Abb. und zahlreiche Tabellen. National Biomedical Research Foundation, Silver Spring/Maryland 1969. Preis $ 12,50 , 1971 .

[16]  André Goffeau,et al.  The yeast genome directory. , 1997, Nature.

[17]  Jijun Tang,et al.  Reconstructing phylogenies from gene-content and gene-order data , 2007, Mathematics of Evolution and Phylogeny.

[18]  Luay Nakhleh,et al.  SPR-based Tree Reconciliation: Non-binary Trees and Multiple Solutions , 2008, APBC.

[19]  B. Rannala,et al.  Phylogenetic inference using whole genomes. , 2008, Annual review of genomics and human genetics.

[20]  Luay Nakhleh,et al.  PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships , 2008, BMC Bioinformatics.

[21]  M. Nei,et al.  Relationships between Gene Trees and Species Trees1 , 1998 .

[22]  Bernard M. E. Moret,et al.  NetGen: generating phylogenetic networks with diploid hybrids , 2006, Bioinform..

[23]  Coenraad Bron,et al.  Finding all cliques of an undirected graph , 1973 .

[24]  Patricia A. McLenachan,et al.  A Statistical Approach for Distinguishing Hybridization and Incomplete Lineage Sorting , 2009, The American Naturalist.

[25]  H. Ochman,et al.  Evolution in bacteria: Evidence for a universal substitution rate in cellular genomes , 2005, Journal of Molecular Evolution.

[26]  Akira Tanaka,et al.  The worst-case time complexity for generating all maximal cliques and computational experiments , 2006, Theor. Comput. Sci..

[27]  G. Sherlock,et al.  Reconstruction of the genome origins and evolution of the hybrid lager yeast Saccharomyces pastorianus. , 2008, Genome research.

[28]  M. Steel,et al.  Subtree Transfer Operations and Their Induced Metrics on Evolutionary Trees , 2001 .

[29]  M. Nei,et al.  Gene genealogy and variance of interpopulational nucleotide differences. , 1985, Genetics.

[30]  N. N. Voront︠s︡ov,et al.  The Use of Tree Comparison Metrics , 1985 .

[31]  D. Pearl,et al.  High-resolution species trees without concatenation , 2007, Proceedings of the National Academy of Sciences.

[32]  J. Kissinger,et al.  The Apicomplexan Whole-Genome Phylogeny: An Analysis of Incongruence among Gene Trees , 2008, Molecular biology and evolution.

[33]  J. Bull,et al.  Partitioning and combining data in phylogenetic analysis , 1993 .

[34]  Daniel H. Huson,et al.  SplitsTree: analyzing and visualizing evolutionary data , 1998, Bioinform..

[35]  T. Tuller,et al.  Inferring phylogenetic networks by the maximum parsimony criterion: a case study. , 2006, Molecular biology and evolution.

[36]  Tandy J. Warnow,et al.  Kaikoura Tree Theorems: Computing the Maximum Agreement Subtree , 1993, Inf. Process. Lett..

[37]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[38]  C. Aquadro,et al.  Mitochondrial DNA differentiation during the speciation process in Peromyscus. , 1983, Molecular biology and evolution.

[39]  F. Sanger,et al.  Nucleotide sequence of bacteriophage phi X174 DNA. , 1977, Nature.

[40]  Tandy J. Warnow,et al.  Towards the Development of Computational Tools for Evaluating Phylogenetic Network Reconstruction Methods , 2002, Pacific Symposium on Biocomputing.

[41]  A. Kluge A Concern for Evidence and a Phylogenetic Hypothesis of Relationships among Epicrates (Boidae, Serpentes) , 1989 .

[42]  A. Querol,et al.  Enological characterization of natural hybrids from Saccharomyces cerevisiae and S. kudriavzevii. , 2007, International journal of food microbiology.

[43]  D. Sankoff,et al.  Gene order comparisons for phylogenetic inference: evolution of the mitochondrial genome. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[44]  W. Maddison,et al.  Inferring phylogeny despite incomplete lineage sorting. , 2006, Systematic biology.

[45]  M. Arnold Natural Hybridization as an Evolutionary Process , 1992 .

[46]  P. Pevzner,et al.  Genome-scale evolution: reconstructing gene orders in the ancestral species. , 2002, Genome research.

[47]  Luay Nakhleh,et al.  Techniques for Assessing Phylogenetic Branch Support: A Performance Study , 2005, APBC.

[48]  Uzi Vishkin,et al.  On Finding Lowest Common Ancestors: Simplification and Parallelization , 1988, AWOC.

[49]  Ina Koch,et al.  Enumerating all connected maximal common subgraphs in two graphs , 2001, Theor. Comput. Sci..

[50]  J. Mallet Hybrid speciation , 2007, Nature.

[51]  Tandy J. Warnow,et al.  Reconstructing reticulate evolution in species: theory and practice , 2004, RECOMB.

[52]  G. Sensabaugh,et al.  Complete genome sequence of USA300, an epidemic clone of community-acquired meticillin-resistant Staphylococcus aureus , 2006, The Lancet.

[53]  N. D. Levine,et al.  Taxonomy and review of the coccidian genus Cryptosporidium (protozoa, apicomplexa). , 1984, The Journal of protozoology.

[54]  A. Querol,et al.  Molecular Characterization of New Natural Hybrids of Saccharomyces cerevisiae and S. kudriavzevii in Brewing , 2008, Applied and Environmental Microbiology.

[55]  Eric Bapteste,et al.  Deduction of probable events of lateral gene transfer through comparison of phylogenetic trees by recursive consolidation and rearrangement , 2005, BMC Evolutionary Biology.

[56]  B. A. Pierce,et al.  Genetics: A Conceptual Approach , 2002 .

[57]  Uzi Vishkin,et al.  Recursive Star-Tree Parallel Data Structure , 1993, SIAM J. Comput..

[58]  Carsten Wiuf,et al.  Gene Genealogies, Variation and Evolution - A Primer in Coalescent Theory , 2004 .

[59]  B. Barrell,et al.  Complete genomes of two clinical Staphylococcus aureus strains: evidence for the rapid evolution of virulence and drug resistance. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[60]  D. Penny Inferring Phylogenies.—Joseph Felsenstein. 2003. Sinauer Associates, Sunderland, Massachusetts. , 2004 .

[61]  Charles Semple,et al.  On the Computational Complexity of the Rooted Subtree Prune and Regraft Distance , 2005 .

[62]  Shuji Tsukiyama,et al.  A New Algorithm for Generating All the Maximal Independent Sets , 1977, SIAM J. Comput..

[63]  Yoshio Tateno,et al.  Accuracy of estimated phylogenetic trees from molecular data , 1983, Journal of Molecular Evolution.

[64]  Samuel V. Angiuoli,et al.  Insights on Evolution of Virulence and Resistance from the Complete Genome Analysis of an Early Methicillin-Resistant Staphylococcus aureus Strain and a Biofilm-Producing Methicillin-Resistant Staphylococcus epidermidis Strain , 2005, Journal of bacteriology.

[65]  W. Maddison Gene Trees in Species Trees , 1997 .

[66]  Patrick Aloy,et al.  Systematic searches for molecular synapomorphies in model metazoan genomes give some support for Ecdysozoa after accounting for the idiosyncrasies of Caenorhabditis elegans , 2004, Evolution & development.

[67]  Vladimir Makarenkov,et al.  T-REX: reconstructing and visualizing phylogenetic trees and reticulation networks , 2001, Bioinform..

[68]  Richard Cronn,et al.  Evolutionary relationships among Pinus (Pinaceae) subsections inferred from multiple low-copy nuclear loci. , 2005, American journal of botany.

[69]  S. Fitz-Gibbon,et al.  Whole genome-based phylogenetic analysis of free-living microorganisms. , 1999, Nucleic acids research.

[70]  D. Pearl,et al.  Species trees from gene trees: reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions. , 2007, Systematic biology.

[71]  W. P. Maddison,et al.  Mesquite: a modular system for evolutionary analysis. Version 2.01 (Build j28) , 2007 .

[72]  Laura Salter Kubatko,et al.  Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. , 2009, Theoretical population biology.

[73]  F. Tajima Evolutionary relationship of DNA sequences in finite populations. , 1983, Genetics.

[74]  S. O’Brien,et al.  Molecular phylogenetics and the origins of placental mammals , 2001, Nature.

[75]  J. Wakeley Coalescent Theory: An Introduction , 2008 .

[76]  Yufeng Wu,et al.  A practical method for exact computation of subtree prune and regraft distance , 2009, Bioinform..

[77]  M. Hattori,et al.  Nucleotide substitutions in Staphylococcus aureus strains, Mu50, Mu3, and N315. , 2004, DNA research : an international journal for rapid publication of reports on genes and genomes.

[78]  N. Rosenberg,et al.  Discordance of Species Trees with Their Most Likely Gene Trees , 2006, PLoS genetics.

[79]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[80]  Michael M. Miyamoto,et al.  TESTING SPECIES PHYLOGENIES AND PHYLOGENETIC METHODS WITH CONGRUENCE , 1995 .

[81]  Leslie E. Trotter,et al.  Vertex packings: Structural properties and algorithms , 1975, Math. Program..

[82]  Michael T. Hallett,et al.  Efficient algorithms for lateral gene transfer problems , 2001, RECOMB.

[83]  M. Telford Phylogenomics , 2007, Current Biology.

[84]  Luay Nakhleh,et al.  Efficient inference of bacterial strain trees from genome-scale multilocus data , 2008, ISMB.

[85]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[86]  Alan M. Moses,et al.  Widespread Discordance of Gene Trees with Species Tree in Drosophila: Evidence for Incomplete Lineage Sorting , 2006, PLoS genetics.

[87]  D. Robinson Comparison of labeled trees with valency three , 1971 .

[88]  M. Kanehisa,et al.  Whole genome sequencing of meticillin-resistant Staphylococcus aureus , 2001, The Lancet.

[89]  M. Nei Standard error of immunological dating of evolutionary time , 1977, Journal of Molecular Evolution.

[90]  Luay Nakhleh,et al.  RIATA-HGT: A Fast and Accurate Heuristic for Reconstructing Horizontal Gene Transfer , 2005, COCOON.

[91]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[92]  J. Lake,et al.  Deriving the genomic tree of life in the presence of horizontal gene transfer: conditioned reconstruction. , 2004, Molecular biology and evolution.

[93]  F. de la Cruz,et al.  Horizontal gene transfer and the origin of species: lessons from bacteria. , 2000, Trends in microbiology.

[94]  C. J-F,et al.  THE COALESCENT , 1980 .

[95]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[96]  Paul J Planet Reexamining microbial evolution through the lens of horizontal transfer. , 2002, EXS.

[97]  H. Philippe,et al.  Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. , 2005, Molecular biology and evolution.

[98]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[99]  Tandy J. Warnow,et al.  Phylogenetic networks: modeling, reconstructibility, and accuracy , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[100]  L. Nakhleh Evolutionary Phylogenetic Networks: Models and Issues , 2010 .

[101]  Nicholas Hamilton,et al.  Phylogenetic identification of lateral genetic transfer events , 2006, BMC Evolutionary Biology.

[102]  Ge Xia,et al.  Seeing the trees and their branches in the network is hard , 2007, Theor. Comput. Sci..

[103]  Elchanan Mossel,et al.  Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[104]  N. Grishin,et al.  Genome trees and the tree of life. , 2002, Trends in genetics : TIG.

[105]  Luay Nakhleh,et al.  Confounding Factors in HGT Detection: Statistical Error, Coalescent Effects, and Multiple Solutions , 2007, J. Comput. Biol..

[106]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[107]  James M. Musser,et al.  Molecular Correlates of Host Specialization in Staphylococcus aureus , 2007, PloS one.

[108]  L. Kubatko,et al.  Inconsistency of phylogenetic estimates from concatenated data under coalescence. , 2007, Systematic biology.

[109]  M. Nei,et al.  Molecular Evolution and Phylogenetics , 2000 .

[110]  Y. Nagai,et al.  Genome and virulence determinants of high virulence community-acquired MRSA , 2002, The Lancet.

[111]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[112]  J. Mallet Hybridization as an invasion of the genome. , 2005, Trends in ecology & evolution.

[113]  R. Doolittle,et al.  Phylogeny determined by protein domain content. , 2005, Proceedings of the National Academy of Sciences of the United States of America.