ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy

Species tree inference via summary methods that combine gene trees has become an increasingly common analysis in recent phylogenomic studies. This broad adoption has been partly due to the greater availability of genome-wide data and ample recognition that gene trees and species trees can differ due to biological processes such as gene duplication and gene loss. This increase has also been encouraged by the recent development of accurate and scalable summary methods, such as ASTRAL. However, most of these methods, including ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. In this paper, we introduce a measure of quartet similarity between single-copy and multi-copy trees (accounting for orthology and paralogy relationships) that can be optimized via a scalable dynamic programming similar to the one used by ASTRAL. We then present a new quartet-based species tree inference method: ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs). By studying its performance on an extensive collection of simulated datasets and on a real plant dataset, we show that ASTRAL-Pro is more accurate than alternative methods when gene trees differ from the species tree due to the simultaneous presence of gene duplication, gene loss, incomplete lineage sorting, and estimation errors.

[1]  Oliver Eulenstein,et al.  Efficient genome-scale phylogenetic analysis under the duplication-loss and deep coalescence cost models , 2010, BMC Bioinformatics.

[2]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[3]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[4]  Manolis Kellis,et al.  Reconciliation Revisited: Handling Multiple Optima When Reconciling with Duplication, Transfer, and Loss , 2013, RECOMB.

[5]  Jesús A. Ballesteros,et al.  A New Orthology Assessment Method for Phylogenomic Data: Unrooted Phylogenetic Orthology. , 2016, Molecular biology and evolution.

[6]  Jesús A. Ballesteros,et al.  A New Orthology Assessment Method for Phylogenomic Data: Unrooted Phylogenetic Orthology. , 2016, Molecular biology and evolution.

[7]  Manolis Kellis,et al.  Unified modeling of gene duplication, loss, and coalescence using a locus tree. , 2012, Genome research.

[8]  Tandy Warnow,et al.  Data from: FastMulRFS: Statistically consistent polynomial time species tree estimation under gene duplication , 2019 .

[9]  Krister M. Swenson,et al.  Gene tree correction guided by orthology , 2013, BMC Bioinformatics.

[10]  J. Lagergren,et al.  Probabilistic orthology analysis. , 2009, Systematic biology.

[11]  Stephen A. Smith,et al.  Orthology Inference in Nonmodel Organisms Using Transcriptomes and Low-Coverage Genomes: Improving Accuracy and Matrix Occupancy for Phylogenomics , 2014, Molecular biology and evolution.

[12]  Oliver Eulenstein,et al.  DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony , 2008, Bioinform..

[13]  Yufeng Wu,et al.  COALESCENT‐BASED SPECIES TREE INFERENCE FROM GENE TREE TOPOLOGIES UNDER INCOMPLETE LINEAGE SORTING BY MAXIMUM LIKELIHOOD , 2012, Evolution; international journal of organic evolution.

[14]  É. Tannier,et al.  The Inference of Gene Trees with Species Trees , 2013, Systematic biology.

[15]  Michael T. Hallett,et al.  New algorithms for the duplication-loss model , 2000, RECOMB '00.

[16]  Liang Liu,et al.  Estimating species trees from unrooted gene trees. , 2011, Systematic biology.

[17]  Elchanan Mossel,et al.  Incomplete Lineage Sorting: Consistent Phylogeny Estimation from Multiple Loci , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Saravanaraj N. Ayyampalayam,et al.  Phylotranscriptomic analysis of the origin and early diversification of land plants , 2014, Proceedings of the National Academy of Sciences.

[19]  Colin N. Dewey,et al.  BUCKy: Gene tree/species tree reconciliation with Bayesian concordance analysis , 2010, Bioinform..

[20]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[21]  David Bryant,et al.  Next-generation sequencing reveals phylogeographic structure and a species tree for recent bird divergences. , 2009, Molecular phylogenetics and evolution.

[22]  Jacob A. Esselstyn,et al.  The Challenges of Resolving a Rapid, Recent Radiation: Empirical and Simulated Phylogenomics of Philippine Shrews. , 2015, Systematic biology.

[23]  Dannie Durand,et al.  A Hybrid Micro-Macroevolutionary Approach to Gene Tree Reconstruction , 2005, RECOMB.

[24]  Manuel Lafond,et al.  On the Weighted Quartet Consensus problem , 2017, CPM.

[25]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[26]  Jesús A. Ballesteros,et al.  A Critical Appraisal of the Placement of Xiphosura (Chelicerata) with Account of Known Sources of Phylogenetic Error. , 2019, Systematic biology.

[27]  Chao Zhang,et al.  ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees , 2018, BMC Bioinformatics.

[28]  Yann Ponty,et al.  ecceTERA: comprehensive gene tree-species tree reconciliation using parsimony , 2016, Bioinform..

[29]  D. Pearl,et al.  Estimating species phylogenies using coalescence times among sequences. , 2009, Systematic biology.

[30]  Nadia El-Mabrouk,et al.  Maximizing Synteny Blocks to Identify Ancestral Homologs , 2005, Comparative Genomics.

[31]  Tandy Warnow,et al.  ASTRID: Accurate Species TRees from Internode Distances , 2015, bioRxiv.

[32]  Manolis Kellis,et al.  TreeFix: Statistically Informed Gene Tree Error Correction Using Species Trees , 2012, Systematic biology.

[33]  Bengt Sennblad,et al.  Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution , 2004, RECOMB.

[34]  Tandy J. Warnow,et al.  Inferring Optimal Species Trees Under Gene Duplication and Loss , 2013, Pacific Symposium on Biocomputing.

[35]  David Fernández-Baca,et al.  iGTP: A software package for large-scale gene tree parsimony analysis , 2010, BMC Bioinformatics.

[36]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[37]  Tandy Warnow,et al.  Large-scale Species Tree Estimation , 2019, 1904.02600.

[38]  Vincent Berry,et al.  Building species trees from larger parts of phylogenomic databases , 2011, Inf. Comput..

[39]  Riccardo Dondi,et al.  Polytomy refinement for the correction of dubious duplications in gene trees , 2014, Bioinform..

[40]  Tandy Warnow,et al.  FastMulRFS: Fast and accurate species tree estimation under generic gene duplication and loss models , 2020, Bioinform..

[41]  Manolis Kellis,et al.  Reconciliation Revisited: Handling Multiple Optima when Reconciling with Duplication, Transfer, and Loss , 2013, J. Comput. Biol..

[42]  Christian Schlötterer,et al.  Linking Great Apes Genome Evolution across Time Scales Using Polymorphism-Aware Phylogenetic Models , 2013, Molecular biology and evolution.

[43]  Liang Liu,et al.  BEST: Bayesian estimation of species trees under the coalescent model , 2008, Bioinform..

[44]  Di Wu,et al.  Bioinformatics analysis of the epitope regions for norovirus capsid protein , 2013, BMC Bioinformatics.

[45]  Ziheng Yang,et al.  Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. , 2003, Genetics.

[46]  J. G. Burleigh,et al.  Assessing approaches for inferring species trees from multi-copy genes. , 2015, Systematic biology.

[47]  Siavash Mirarab,et al.  Fast Coalescent-Based Computation of Local Branch Support from Quartet Frequencies , 2016, Molecular biology and evolution.

[48]  Nadia El-Mabrouk,et al.  Gene Family Evolution—An Algorithmic Framework , 2019, Bioinformatics and Phylogenetics.

[49]  Edwin Jacox,et al.  Joint amalgamation of most parsimonious reconciled gene trees , 2014, Bioinform..

[50]  Lihua Zhu,et al.  Efficient visible light photo-fenton-like degradation of organic pollutants using in situ surface-modified BiFeO3 as a catalyst. , 2013, Journal of environmental sciences.

[51]  Toni Gabaldón,et al.  TreeKO: a duplication-aware algorithm for the comparison of phylogenetic trees , 2011, Nucleic acids research.

[52]  Erin K. Molloy,et al.  FastMulRFS: Statistically consistent polynomial time species tree estimation under gene duplication , 2019, bioRxiv.

[53]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[54]  Patrick J. Biggs,et al.  Systematic Error in Seed Plant Phylogenomics , 2011, Genome biology and evolution.

[55]  Xiao Sun,et al.  Data access for the 1,000 Plants (1KP) project , 2014, GigaScience.

[56]  Siavash Mirarab,et al.  Anchoring quartet-based phylogenetic distances and applications to species tree reconstruction , 2016, BMC Genomics.

[57]  One Thousand Plant Transcriptomes Initiative One thousand plant transcriptomes and the phylogenomics of green plants , 2019 .

[58]  Yutaka Saito,et al.  Detection of differentially methylated regions from bisulfite-seq data by hidden Markov models incorporating genome-wide methylation level distributions , 2015, BMC Genomics.

[59]  S. Kelly,et al.  STAG: Species Tree Inference from All Genes , 2018, bioRxiv.

[60]  Dannie Durand,et al.  A hybrid micro-macroevolutionary approach to gene tree reconstruction. , 2006 .

[61]  W. Maddison Gene Trees in Species Trees , 1997 .

[62]  Lawrence A. David,et al.  Rapid evolutionary innovation during an Archaean genetic expansion , 2011, Nature.

[63]  Luay Nakhleh,et al.  Species Tree Inference under the Multispecies Coalescent on Data with Paralogs is Accurate , 2018, bioRxiv.

[64]  Hervé Philippe,et al.  Origin of land plants revisited in the light of sequence contamination and missing data , 2012, Current Biology.

[65]  Tandy J. Warnow,et al.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes , 2015, Bioinform..

[66]  Chao Zhang,et al.  ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization , 2019, Bioinform..

[67]  D. Posada,et al.  A Bayesian Supertree Model for Genome-Wide Species Tree Reconstruction , 2014, Systematic biology.

[68]  Tandy Warnow,et al.  Polynomial-Time Statistical Estimation of Species Trees under Gene Duplication and Loss , 2019, bioRxiv.

[69]  Cedric Chauve,et al.  Models and Algorithms for Genome Evolution , 2013, Computational Biology.

[70]  Terence Tao,et al.  A Bound on Partitioning Clusters , 2017, Electron. J. Comb..

[71]  Nadia El-Mabrouk,et al.  Efficient Gene Tree Correction Guided by Genome Evolution , 2016, PloS one.

[72]  Laurent Gueguen,et al.  Duplication, Rearrangement and Reconciliation: A Follow-Up 13 Years Later , 2013, Models and Algorithms for Genome Evolution.

[73]  Siavash Mirarab,et al.  DiscoVista: Interpretable visualizations of gene tree discordance. , 2017, Molecular phylogenetics and evolution.

[74]  Siavash Mirarab,et al.  Testing for Polytomies in Phylogenetic Species Trees Using Quartet Frequencies , 2017, Genes.

[75]  G. Moore,et al.  Fitting the gene lineage into its species lineage , 1979 .

[76]  David Fernández-Baca,et al.  Inferring species trees from incongruent multi-copy gene trees using the Robinson-Foulds distance , 2012, Algorithms for Molecular Biology.

[77]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[78]  M. Gouy,et al.  Genome-scale coestimation of species and gene trees , 2013, Genome research.

[79]  Yang Zhong,et al.  The position of gnetales among seed plants: overcoming pitfalls of chloroplast phylogenomics. , 2010, Molecular biology and evolution.

[80]  M. Nei,et al.  Relationships between gene trees and species trees. , 1988, Molecular biology and evolution.

[81]  Felipe Zapata,et al.  Agalma: an automated phylogenomics workflow , 2013, BMC Bioinformatics.

[82]  Maryam Rabiee,et al.  Multi-allele species reconstruction using ASTRAL , 2018, bioRxiv.

[83]  Tandy Warnow,et al.  To include or not to include: The impact of gene filtering on species tree estimation methods , 2017, bioRxiv.

[84]  Bengt Sennblad,et al.  The gene evolution model and computing its associated probabilities , 2009, JACM.

[85]  Laura Salter Kubatko,et al.  Quartet Inference from SNP Data Under the Coalescent Model , 2014, Bioinform..

[86]  Scott V Edwards,et al.  A maximum pseudo-likelihood approach for estimating species trees under the coalescent model , 2010, BMC Evolutionary Biology.

[87]  J. G. Burleigh,et al.  Phylogenetic signal in nucleotide data from seed plants: implications for resolving the seed plant tree of life. , 2004, American journal of botany.

[88]  David Posada,et al.  SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees , 2015, bioRxiv.