Fast and accurate statistical inference of phylogenetic networks using large-scale genomic sequence data

An emerging discovery in phylogenomics is that interspecific gene flow has played a major role in the evolution of many different organisms. To what extent is the Tree of Life not truly a tree reflecting strict “vertical” divergence, but rather a more general graph structure known as a phylogenetic network which also captures “horizontal”gene flow? The answer to this fundamental question not only depends upon densely sampled and divergent genomic sequence data, but also compu-tational methods which are capable of accurately and efficiently inferring phylogenetic networks from large-scale genomic sequence datasets. Re-cent methodological advances have attempted to address this gap. How-ever, in the 2016 performance study of Hejase and Liu, state-of-the-art methods fell well short of the scalability requirements of existing phy-logenomic studies. The methodological gap remains: how can phylogenetic networks be ac-curately and efficiently inferred using genomic sequence data involving many dozens or hundreds of taxa? In this study, we address this gap by proposing a new phylogenetic divide-and-conquer method which we call FastNet. We conduct a performance study involving a range of evolu-tionary scenarios, and we demonstrate that FastNet outperforms state-of-the-art methods in terms of computational efficiency and topological accuracy.

[1]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[2]  Antonis Rokas,et al.  Inferring ancient divergences requires genes with strong phylogenetic signals , 2013, Nature.

[3]  Kevin J. Liu,et al.  Interspecific introgressive origin of genomic diversity in the house mouse , 2013, Proceedings of the National Academy of Sciences.

[4]  Claudia R. Solís-Lemus,et al.  Inferring Phylogenetic Networks with Maximum Pseudolikelihood under Incomplete Lineage Sorting , 2015, PLoS genetics.

[5]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[6]  J. Huelsenbeck,et al.  SUCCESS OF PHYLOGENETIC METHODS IN THE FOUR-TAXON CASE , 1993 .

[7]  Daniel H. Huson,et al.  Phylogenetic Networks: Algorithms and applications , 2011 .

[8]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[9]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[10]  S. Edwards IS A NEW AND GENERAL THEORY OF MOLECULAR SYSTEMATICS EMERGING? , 2009, Evolution; international journal of organic evolution.

[11]  J. Mallet Hybrid speciation , 2007, Nature.

[12]  Kenneth H. Wolfe,et al.  Origin of the Yeast Whole-Genome Duplication , 2015, PLoS biology.

[13]  B. Birren,et al.  Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae , 2004, Nature.

[14]  Michael J. Sanderson,et al.  R8s: Inferring Absolute Rates of Molecular Evolution, Divergence times in the Absence of a Molecular Clock , 2003, Bioinform..

[15]  Tandy Warnow,et al.  Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer , 2015, bioRxiv.

[16]  Philip L. F. Johnson,et al.  Genetic history of an archaic hominin group from Denisova Cave in Siberia , 2010, Nature.

[17]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[18]  Tandy Warnow,et al.  Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. , 2016, Systematic biology.

[19]  A. Dress,et al.  A canonical decomposition theory for metrics on a finite set , 1992 .

[20]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[21]  Anders E. Halager,et al.  A New Isolation with Migration Model along Complete Genomes Infers Very Different Divergence Processes among Closely Related Great Ape Species , 2012, PLoS genetics.

[22]  Simon H. Martin,et al.  Butterfly genome reveals promiscuous exchange of mimicry adaptations among species , 2012, Nature.

[23]  David Reich,et al.  Testing for ancient admixture between closely related populations. , 2011, Molecular biology and evolution.

[24]  Luay Nakhleh,et al.  The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection , 2012, PLoS genetics.

[25]  Ying Song,et al.  An HMM-Based Comparative Genomic Framework for Detecting Introgression in Eukaryotes , 2013, PLoS Comput. Biol..

[26]  S. Jeffery Evolution of Protein Molecules , 1979 .

[27]  Adam P. Arkin,et al.  FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix , 2009, Molecular biology and evolution.

[28]  Yun Yu,et al.  A maximum pseudo-likelihood approach for phylogenetic networks , 2015, BMC Genomics.

[29]  D. Huson,et al.  Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. , 2012, Systematic biology.

[30]  J. Oliver,et al.  The general stochastic model of nucleotide substitution. , 1990, Journal of theoretical biology.

[31]  J. Palmer,et al.  Horizontal gene transfer in eukaryotic evolution , 2008, Nature Reviews Genetics.

[32]  Daniel H. Huson,et al.  Phylogenetic Networks: Contents , 2010 .

[33]  Luay Nakhleh,et al.  Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. , 2011, Systematic biology.

[34]  Tandy J. Warnow,et al.  Towards the Development of Computational Tools for Evaluating Phylogenetic Network Reconstruction Methods , 2002, Pacific Symposium on Biocomputing.

[35]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[36]  J. Slot,et al.  Dimensions of Horizontal Gene Transfer in Eukaryotic Microbial Pathogens , 2015, PLoS pathogens.

[37]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[38]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[39]  V. Moulton,et al.  Neighbor-net: an agglomerative method for the construction of phylogenetic networks. , 2002, Molecular biology and evolution.

[40]  Daniel H. Huson,et al.  Phylogenetic Networks - Concepts, Algorithms and Applications , 2011 .

[41]  Markus S. Schröder,et al.  Comparative Genome Analysis and Gene Finding in Candida Species Using CGOB , 2013, Molecular biology and evolution.

[42]  Clifford M. Hurvich,et al.  Regression and time series model selection in small samples , 1989 .

[43]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[44]  H. Akaike A new look at the statistical model identification , 1974 .

[45]  Kevin J. Liu,et al.  A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation , 2016, BMC Bioinformatics.

[46]  James E. Allen,et al.  Highly evolvable malaria vectors: The genomes of 16 Anopheles mosquitoes , 2014, Science.

[47]  Ziheng Yang,et al.  The influence of gene flow on species tree estimation: a simulation study. , 2014, Systematic biology.

[48]  Serita M. Nelesen,et al.  SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. , 2012, Systematic biology.

[49]  Toni Gabaldón,et al.  Beyond the Whole-Genome Duplication: Phylogenetic Evidence for an Ancient Interspecies Hybridization in the Baker's Yeast Lineage , 2015, PLoS biology.

[50]  Jean-Luc Legras,et al.  Deciphering the Hybridisation History Leading to the Lager Lineage Based on the Mosaic Genomes of Saccharomyces bayanus Strains NBRC1948 and CBS380T , 2011, PloS one.

[51]  Kevin P. Byrne,et al.  Analysis of gene evolution and metabolic pathways using the Candida Gene Order Browser , 2010, BMC Genomics.

[52]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[53]  Kevin P. Byrne,et al.  The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. , 2005, Genome research.

[54]  Philip L. F. Johnson,et al.  A Draft Sequence of the Neandertal Genome , 2010, Science.

[55]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[56]  Luay Nakhleh,et al.  PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships , 2008, BMC Bioinformatics.

[57]  Xiaofang Jiang,et al.  Extensive introgression in a malaria vector species complex revealed by phylogenomics , 2015, Science.

[58]  Sonja J. Prohaska,et al.  Proteinortho: Detection of (Co-)orthologs in large-scale analysis , 2011, BMC Bioinformatics.

[59]  Hae Kyung Im,et al.  Survey of the Heritability and Sparse Architecture of Gene Expression Traits across Human Tissues , 2016, bioRxiv.

[60]  David Bryant,et al.  Next-generation sequencing reveals phylogeographic structure and a species tree for recent bird divergences. , 2009, Molecular phylogenetics and evolution.

[61]  Tandy J. Warnow,et al.  PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences , 2015, J. Comput. Biol..

[62]  P. Philippsen,et al.  The Ashbya gossypii Genome as a Tool for Mapping the Ancient Saccharomyces cerevisiae Genome , 2004, Science.

[63]  Luay Nakhleh,et al.  Reticulate evolutionary history and extensive introgression in mosquito species revealed by phylogenetic network analysis , 2016, Molecular ecology.

[64]  Ramón Doallo,et al.  ProtTest 3: fast selection of best-fit models of protein evolution , 2011, Bioinform..

[65]  D. Baum Concordance trees, concordance factors, and the exploration of reticulate genealogy , 2007 .

[66]  L. Nakhleh,et al.  Computational approaches to species phylogeny inference and gene tree reconciliation. , 2013, Trends in ecology & evolution.

[67]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[68]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[69]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[70]  E. Boyd,et al.  Genomic islands are dynamic, ancient integrative elements in bacterial evolution. , 2009, Trends in microbiology.

[71]  C. J-F,et al.  THE COALESCENT , 1980 .

[72]  Kevin J. Liu,et al.  Maximum likelihood inference of reticulate evolutionary histories , 2014, Proceedings of the National Academy of Sciences.

[73]  J. McInerney,et al.  The prokaryotic tree of life: past, present... and future? , 2008, Trends in ecology & evolution.

[74]  Carsten Wiuf,et al.  Gene Genealogies, Variation and Evolution - A Primer in Coalescent Theory , 2004 .