Taxon sampling unequally affects individual nodes in a phylogenetic tree: consequences for model gene tree construction in SwissTree

Medium to large phylogenetic gene trees constructed from datasets of different species density and taxonomic range are rarely topologically consistent because of missing phylogenetic signal, nonphylogenetic signal and error. In this study, we first use simulations to show that taxon sampling unequally affects nodes in a gene tree, which likely contributes to controversial conclusions from taxon sampling experiments and contradicting species phylogenies such as for the boreoeutherians. Hence, because it is unlikely that a large gene tree can be reconstructed correctly based on a single optimized dataset, we take a two-step approach for the construction of model gene trees. First, stable and unstable clades are identified by comparing phylogenetic trees inferred from multiple datasets and data types (nucleotide, amino acid, codon) from the same gene family. Subsequently, data subsets are optimized for the analysis of individual uncertain clades. Results are summarized in form of a model tree that illustrates the evolutionary relationship of gene loci. A case study shows how a seemingly complex gene phylogeny becomes increasingly consistent with the reference species tree by attentive taxon sampling and subtree analysis. The procedure is progressively introduced to SwissTree (http://swisstree.vital-it.ch), a resource of high confidence model gene (locus) trees. Finally we demonstrate the usefulness of SwissTree for orthology benchmarking.

[1]  Alexander C. J. Roth,et al.  Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits , 2006, Nucleic acids research.

[2]  B. Efron,et al.  Bootstrap confidence levels for phylogenetic trees. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[3]  D. P. Wall,et al.  Detecting putative orthologs , 2003, Bioinform..

[4]  Manolis Kellis,et al.  Unified modeling of gene duplication, loss, and coalescence using a locus tree. , 2012, Genome research.

[5]  O. Gascuel,et al.  Survey of Branch Support Methods Demonstrates Accuracy, Power, and Robustness of Fast Likelihood-based Approximation Schemes , 2011, Systematic biology.

[6]  Alessandro Vullo,et al.  Ensembl 2017 , 2016, Nucleic Acids Res..

[7]  Fabian Schreiber,et al.  Hieranoid: hierarchical orthology inference. , 2013, Journal of molecular biology.

[8]  Ioannis Xenarios,et al.  T-Coffee: a web server for the multiple sequence alignment of protein and RNA sequences using structural information and homology extension , 2011, Nucleic Acids Res..

[9]  G. Gonnet,et al.  ALF—A Simulation Framework for Genome Evolution , 2011, Molecular biology and evolution.

[10]  C. Dessimoz,et al.  Phylo.io: Interactive Viewing and Comparison of Large Phylogenetic Trees on the Web , 2016, Molecular biology and evolution.

[11]  Damian Szklarczyk,et al.  eggNOG v4.0: nested orthology inference across 3686 organisms , 2013, Nucleic Acids Res..

[12]  Maria Jesus Martin,et al.  Big data and other challenges in the quest for orthologs , 2014, Bioinform..

[13]  Leszek P. Pryszcz,et al.  MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score , 2010, Nucleic acids research.

[14]  Anushya Muruganujan,et al.  PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees , 2012, Nucleic Acids Res..

[15]  H Philippe,et al.  Species sampling has a major impact on phylogenetic inference. , 1993, Molecular phylogenetics and evolution.

[16]  Ramón Doallo,et al.  ProtTest 3: fast selection of best-fit models of protein evolution , 2011, Bioinform..

[17]  Ioannis Xenarios,et al.  Conceptual framework and pilot study to benchmark phylogenomic databases based on reference gene trees , 2011, Briefings Bioinform..

[18]  Adrian M. Altenhoff,et al.  Standardized benchmarking in the quest for orthologs , 2016, Nature Methods.

[19]  Gaston H. Gonnet,et al.  Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs , 2013, PloS one.

[20]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Olivier Poch,et al.  OrthoInspector 2.0: Software and database updates , 2015, Bioinform..

[22]  S. O’Brien,et al.  Molecular phylogenetics and the origins of placental mammals , 2001, Nature.

[23]  Alex Bateman,et al.  TreeFam v9: a new website, more species and orthology-on-the-fly , 2013, Nucleic Acids Res..

[24]  S. Lewis,et al.  Quest for Orthologs Entails Quest for Tree of Life: In Search of the Gene Stream , 2015, Genome biology and evolution.

[25]  Christoph Mayer,et al.  Visualizing differences in phylogenetic information content of alignments and distinction of three classes of long-branch effects , 2007, BMC Evolutionary Biology.

[26]  Diana J. Kao,et al.  Parallel adaptive radiations in two major clades of placental mammals , 2001, Nature.

[27]  B. Rannala,et al.  Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. , 2004, Systematic biology.

[28]  D. Swofford,et al.  Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? , 2001, Systematic biology.

[29]  Kazutaka Katoh,et al.  A simple method to control over-alignment in the MAFFT multiple sequence alignment program , 2016, Bioinform..

[30]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[31]  Erik L. L. Sonnhammer,et al.  InParanoid 7: new algorithms and tools for eukaryotic orthology analysis , 2009, Nucleic Acids Res..

[32]  P. Bork,et al.  Orthology prediction methods: A quality assessment using curated protein families , 2011, BioEssays : news and reviews in molecular, cellular and developmental biology.

[33]  Albert J. Vilella,et al.  EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. , 2009, Genome research.

[34]  Yu Lin,et al.  Bootstrapping phylogenies inferred from rearrangement data , 2011, Algorithms for Molecular Biology.

[35]  Gaston H. Gonnet,et al.  The OMA orthology database in 2015: function predictions, better plant support, synteny view and other improvements , 2014, Nucleic Acids Res..

[36]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[37]  É. Tannier,et al.  The Inference of Gene Trees with Species Trees , 2013, Systematic biology.

[38]  Sen Song,et al.  Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model , 2012, Proceedings of the National Academy of Sciences.

[39]  Geoffrey J. Barton,et al.  Jalview Version 2—a multiple sequence alignment editor and analysis workbench , 2009, Bioinform..

[40]  Gaston H. Gonnet,et al.  A Phylogenomic Study of Human, Dog, and Mouse , 2006, PLoS Comput. Biol..

[41]  Simon Easteal,et al.  Rates of genome evolution and branching order from whole genome analysis. , 2007, Molecular biology and evolution.

[42]  Indra Neil Sarkar,et al.  The impact of taxon sampling on phylogenetic inference: a review of two decades of controversy , 2012, Briefings Bioinform..

[43]  Gaston H. Gonnet,et al.  Empirical codon substitution matrix , 2005, BMC Bioinformatics.

[44]  Sudhir Kumar,et al.  MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets. , 2016, Molecular biology and evolution.

[45]  Evgeny M. Zdobnov,et al.  The Newick utilities: high-throughput phylogenetic tree processing in the Unix shell , 2010, Bioinform..

[46]  Salvador Capella-Gutiérrez,et al.  PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome , 2013, Nucleic Acids Res..