Disentangling biological and analytical factors that give rise to outlier genes in phylogenomic matrices

The genomic data revolution has enabled biologists to develop innovative ways to infer key episodes in the history of life. Whether genome-scale data will eventually resolve all branches of the Tree of Life remains uncertain. However, through novel means of interrogating data, some explanations for why evolutionary relationships remain recalcitrant are emerging. Here, we provide four biological and analytical factors that explain why certain genes may exhibit “outlier” behavior, namely, rate of molecular evolution, alignment length, misidentified orthology, and errors in modeling. Using empirical and simulated data we show how excluding genes based on their likelihood or inferring processes from the topology they support in a supermatrix can mislead biological inference of conflict. We next show alignment length accounts for the high influence of two genes reported in empirical datasets. Finally, we also reiterate the impact misidentified orthology and short alignments have on likelihoods in large scale phylogenetics. We suggest that researchers should systematically investigate and describe the source of influential genes, as opposed to discarding them as outliers. Disentangling whether analytical or biological factors are the source of outliers will help uncover new patterns and processes that are shaping the Tree of Life.

[1]  Stephen A. Smith,et al.  A matter of phylogenetic scale: Distinguishing incomplete lineage sorting from lateral gene transfer as the cause of gene tree discord in recent versus deep diversification histories. , 2018, American journal of botany.

[2]  F. Delsuc,et al.  Phylogenomic analyses support the position of turtles as the sister group of birds and crocodiles (Archosauria) , 2012, BMC Biology.

[3]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[4]  Anne Chenuil,et al.  Can the Cambrian explosion be inferred through molecular phylogeny , 1994 .

[5]  Daniel B. Sloan,et al.  Partitioned coalescence support reveals biases in species-tree methods and detects gene trees that determine phylogenomic conflicts. , 2019, Molecular phylogenetics and evolution.

[6]  Benoit Morel,et al.  GeneRax: A tool for species tree-aware maximum likelihood based gene tree inference under gene duplication, transfer, and loss , 2019, bioRxiv.

[7]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[8]  K. Burns,et al.  Mitochondrial genomes of the bird genus Piranga: rates of sequence evolution, and discordance between mitochondrial and nuclear markers , 2019, Mitochondrial DNA. Part B, Resources.

[9]  John Gatesy,et al.  On the importance of homology in the age of phylogenomics , 2018 .

[10]  M. S. Lee,et al.  Partitioned likelihood support and the evaluation of data set conflict. , 2003, Systematic biology.

[11]  A. Rokas,et al.  Contentious relationships in phylogenomic studies can be driven by a handful of genes , 2017, Nature Ecology &Evolution.

[12]  A. Lemmon,et al.  Uncovering the genomic signature of ancient introgression between white oak lineages (Quercus). , 2019, The New phytologist.

[13]  B. O’Meara,et al.  The Implications of Over-Estimating Gene Tree Discordance on a Rapid-Radiation Species Tree (Blattodea: Blaberidae) , 2019, bioRxiv.

[14]  Stephen A. Smith,et al.  Resolving the evolutionary relationships of molluscs with phylogenomic tools , 2011, Nature.

[15]  Stephen A. Smith,et al.  Analyzing contentious relationships and outlier genes in phylogenomics , 2017, bioRxiv.

[16]  K. Strimmer,et al.  Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[17]  M. Holder,et al.  The phylogenetic position of Myxozoa: exploring conflicting signals in phylogenomic and ribosomal data sets. , 2010, Molecular biology and evolution.

[18]  Stephen A. Smith,et al.  Heterogeneous molecular processes among the causes of how sequence similarity scores can fail to recapitulate phylogeny , 2016, Briefings Bioinform..

[19]  Joseph W. Brown,et al.  Phyx: phylogenetic tools for unix , 2017, Bioinform..

[20]  Stephen A. Smith,et al.  Widespread paleopolyploidy, gene tree conflict, and recalcitrant relationships among the carnivorous Caryophyllales. , 2017, American journal of botany.

[21]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[22]  Richard H. Baker,et al.  Partitioned coalescence support reveals biases in species-tree methods and detects gene trees that determine phylogenomic conflicts , 2018, bioRxiv.

[23]  Peter G Foster,et al.  Modeling compositional heterogeneity. , 2004, Systematic biology.

[24]  Jeremy M. Brown,et al.  Variation Across Mitochondrial Gene Trees Provides Evidence for Systematic Error: How Much Gene Tree Variation Is Biological? , 2018, Systematic biology.

[25]  Alexey M. Kozlov,et al.  GeneRax: A Tool for Species-Tree-Aware Maximum Likelihood-Based Gene Family Tree Inference under Gene Duplication, Transfer, and Loss , 2020, Molecular biology and evolution.

[26]  Jacob L. Steenwyk,et al.  A Robust Phylogenomic Time Tree for Biotechnologically and Medically Important Fungi in the Genera Aspergillus and Penicillium , 2018, mBio.

[27]  Moriya Ohkuma,et al.  Tempo and Mode of Genome Evolution in the Budding Yeast Subphylum , 2018, Cell.

[28]  Charles S. P. Foster,et al.  Linking Branch Lengths Across Loci Provides the Best Fit for Phylogenetic Inference , 2018 .

[29]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[30]  Todd A. Castoe,et al.  Evidence for an ancient adaptive episode of convergent molecular evolution , 2009, Proceedings of the National Academy of Sciences.

[31]  Antonis Rokas,et al.  Inferring ancient divergences requires genes with strong phylogenetic signals , 2013, Nature.

[32]  W. Murphy,et al.  Recombination-Aware Phylogenomics Reveals the Structured Genomic Landscape of Hybridizing Cat Species , 2019, Molecular biology and evolution.

[33]  Jeremy M. Brown,et al.  Evaluating Model Performance in Evolutionary Biology , 2018, Annual Review of Ecology, Evolution, and Systematics.

[34]  A. Rokas,et al.  Evaluating Ortholog Prediction Algorithms in a Yeast Model Clade , 2011, PloS one.

[35]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[36]  A. von Haeseler,et al.  UFBoot2: Improving the Ultrafast Bootstrap Approximation , 2017, bioRxiv.

[37]  Saravanaraj N. Ayyampalayam,et al.  Phylotranscriptomic analysis of the origin and early diversification of land plants , 2014, Proceedings of the National Academy of Sciences.

[38]  T. Gabaldón Large-scale assignment of orthology: back to phylogenetics? , 2008, Genome Biology.

[39]  Jeremy M. Brown,et al.  Bayes Factors Unmask Highly Variable Information Content, Bias, and Extreme Influence in Phylogenomic Analyses , 2016, Systematic biology.

[40]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[41]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[42]  Nick Goldman,et al.  Phylogenetic information and experimental design in molecular systematics , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[43]  Stephen A. Smith,et al.  Evolution of Portulacineae marked by gene tree conflict and gene family expansion associated with adaptation to harsh environments , 2018, bioRxiv.

[44]  C. Landerer,et al.  Population Genetics Based Phylogenetics Under Stabilizing Selection for an Optimal Amino Acid Sequence: A Nested Modeling Approach , 2017, bioRxiv.

[45]  Xiangchao Gan,et al.  Resolving the backbone of the Brassicaceae phylogeny for investigating trait diversity. , 2019, The New phytologist.

[46]  Jeffrey P Townsend,et al.  Profiling phylogenetic informativeness. , 2007, Systematic biology.

[47]  Tae-Kun Seo Calculating bootstrap probabilities of phylogeny using multilocus sequence data. , 2008, Molecular biology and evolution.

[48]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[49]  S. Carroll,et al.  Bushes in the Tree of Life , 2006, PLoS biology.

[50]  A. Whitehead,et al.  Phylogenomic analysis of Fundulidae (Teleostei: Cyprinodotiformes) using RNA-sequencing data. , 2017, Molecular phylogenetics and evolution.

[51]  Robert K. Jansen,et al.  Incongruence between gene trees and species trees and phylogenetic signal variation in plastid genes. , 2019, Molecular phylogenetics and evolution.

[52]  J. Bennetzen,et al.  Lateral transfers of large DNA fragments spread functional genes among grasses , 2019, Proceedings of the National Academy of Sciences.