SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees

We present a fast and flexible software package—SimPhy—for the simulation of multiple gene families evolving under incomplete lineage sorting, gene duplication and loss, horizontal gene transfer—all three potentially leading to species tree/gene tree discordance—and gene conversion. SimPhy implements a hierarchical phylogenetic model in which the evolution of species, locus, and gene trees is governed by global and local parameters (e.g., genome-wide, species-specific, locus-specific), that can be fixed or be sampled from a priori statistical distributions. SimPhy also incorporates comprehensive models of substitution rate variation among lineages (uncorrelated relaxed clocks) and the capability of simulating partitioned nucleotide, codon, and protein multilocus sequence alignments under a plethora of substitution models using the program INDELible. We validate SimPhy's output using theoretical expectations and other programs, and show that it scales extremely well with complex models and/or large trees, being an order of magnitude faster than the most similar program (DLCoal-Sim). In addition, we demonstrate how SimPhy can be useful to understand interactions among different evolutionary processes, conducting a simulation study to characterize the systematic overestimation of the duplication time when using standard reconciliation methods. SimPhy is available at https://github.com/adamallo/SimPhy, where users can find the source code, precompiled executables, a detailed manual and example cases.

[1]  Robert G. Beiko,et al.  A simulation test bed for hypotheses of genome evolution , 2007, Bioinform..

[2]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[3]  S. Edwards IS A NEW AND GENERAL THEORY OF MOLECULAR SYSTEMATICS EMERGING? , 2009, Evolution; international journal of organic evolution.

[4]  David Bryant,et al.  Simulating gene trees under the multispecies coalescent and time-dependent migration , 2013, BMC Evolutionary Biology.

[5]  Manolis Kellis,et al.  Unified modeling of gene duplication, loss, and coalescence using a locus tree. , 2012, Genome research.

[6]  Gergely J. Szöllősi,et al.  Lateral Gene Transfer from the Dead , 2012, Systematic biology.

[7]  Paul R. Staab,et al.  scrm: efficiently simulating long sequences using the approximated coalescent with recombination , 2015, Bioinform..

[8]  R. Britten,et al.  Rates of DNA sequence evolution differ between taxonomic groups. , 1986, Science.

[9]  D. Posada,et al.  Simulation of Genome-Wide Evolution under Heterogeneous Substitution Models and Complex Multispecies Coalescent Histories , 2014, Molecular biology and evolution.

[10]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[11]  Andrew P. Martin,et al.  Body size, metabolic rate, generation time, and the molecular clock. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[12]  D. Maddison,et al.  Mesquite: a modular system for evolutionary analysis. Version 2.6 , 2009 .

[13]  D. Kendall On the Generalized "Birth-and-Death" Process , 1948 .

[14]  Antonis Rokas,et al.  Inferring ancient divergences requires genes with strong phylogenetic signals , 2013, Nature.

[15]  Ramón Doallo,et al.  CircadiOmics: integrating circadian genomics, transcriptomics, proteomics and metabolomics , 2012, Nature Methods.

[16]  Tanja Stadler,et al.  Sampling trees from evolutionary models. , 2010, Systematic biology.

[17]  Bruce Rannala,et al.  The accuracy of species tree estimation under simulation: a comparison of methods. , 2011, Systematic biology.

[18]  Liam J. Revell,et al.  phytools: an R package for phylogenetic comparative biology (and other things) , 2012 .

[19]  Tandy J. Warnow,et al.  Fast and accurate methods for phylogenomic analyses , 2011, BMC Bioinformatics.

[20]  Md. Shamsuzzoha Bayzid,et al.  Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses , 2014, PloS one.

[21]  F. Delsuc,et al.  Phylogenomics: the beginning of incongruence? , 2006, Trends in genetics : TIG.

[22]  Leonardo de Oliveira Martins,et al.  Estimation of Species Trees , 2014 .

[23]  Tandy J. Warnow,et al.  Naive binning improves phylogenomic analyses , 2013, Bioinform..

[24]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[25]  M. Gouy,et al.  Genome-scale coestimation of species and gene trees , 2013, Genome research.

[26]  N. Takahata Gene genealogy in three related populations: consistency probability between gene and population trees. , 1989, Genetics.

[27]  M. Nei,et al.  Relationships between gene trees and species trees. , 1988, Molecular biology and evolution.

[28]  Jeet Sukumaran,et al.  DendroPy: a Python library for phylogenetic computing , 2010, Bioinform..

[29]  David Fernández-Baca,et al.  iGTP: A software package for large-scale gene tree parsimony analysis , 2010, BMC Bioinformatics.

[30]  David Posada,et al.  Unsorted homology within locus and species trees. , 2014, Systematic biology.

[31]  Korbinian Strimmer,et al.  APE: Analyses of Phylogenetics and Evolution in R language , 2004, Bioinform..

[32]  G. Yule,et al.  A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[33]  D. Posada,et al.  A Bayesian Supertree Model for Genome-Wide Species Tree Reconstruction , 2014, Systematic biology.

[34]  N. Galtier A model of horizontal gene transfer and the bacterial phylogeny problem. , 2007, Systematic biology.

[35]  D. Bryant,et al.  Monte Carlo Strategies for Selecting Parameter Values in Simulation Experiments. , 2015, Systematic biology.

[36]  Scott V Edwards,et al.  A maximum pseudo-likelihood approach for estimating species trees under the coalescent model , 2010, BMC Evolutionary Biology.

[37]  S. Ho,et al.  Relaxed Phylogenetics and Dating with Confidence , 2006, PLoS biology.

[38]  Bengt Sennblad,et al.  Bayesian gene/species tree reconciliation and orthology analysis using MCMC , 2003, ISMB.

[39]  Tanja Stadler,et al.  Simulating trees with a fixed number of extant species. , 2011, Systematic biology.

[40]  D. Kohne Evolution of higher-organism DNA , 1970, Quarterly Reviews of Biophysics.

[41]  M. Wallis Variable evolutionary rates in the molecular evolution of mammalian growth hormones , 1994, Journal of Molecular Evolution.

[42]  Tandy Warnow,et al.  Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. , 2016, Systematic biology.

[43]  Ziheng Yang,et al.  Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. , 2003, Genetics.

[44]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[45]  G. Gonnet,et al.  ALF—A Simulation Framework for Genome Evolution , 2011, Molecular biology and evolution.

[46]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[47]  Joel Sjöstrand,et al.  GenPhyloData: realistic simulation of gene family evolution , 2013, BMC Bioinformatics.

[48]  G. Moore,et al.  Fitting the gene lineage into its species lineage , 1979 .

[49]  D. Labie,et al.  Molecular Evolution , 1991, Nature.

[50]  R. Page,et al.  From gene to organismal phylogeny: reconciled trees and the gene tree/species tree problem. , 1997, Molecular phylogenetics and evolution.

[51]  W. Maddison Gene Trees in Species Trees , 1997 .

[52]  Branch lengths on birth-death trees and the expected loss of phylogenetic diversity. , 2010, Systematic biology.

[53]  A. Rodrigo,et al.  Time‐dependent rates of molecular evolution , 2011, Molecular ecology.