From pairs of most similar sequences to phylogenetic best matches

Background Many of the commonly used methods for orthology detection start from mutually most similar pairs of genes (reciprocal best hits) as an approximation for evolutionary most closely related pairs of genes (reciprocal best matches). This approximation of best matches by best hits becomes exact for ultrametric dissimilarities, i.e., under the Molecular Clock Hypothesis. It fails, however, whenever there are large lineage specific rate variations among paralogous genes. In practice, this introduces a high level of noise into the input data for best-hit-based orthology detection methods. Results If additive distances between genes are known, then evolutionary most closely related pairs can be identified by considering certain quartets of genes provided that in each quartet the outgroup relative to the remaining three genes is known. A priori knowledge of underlying species phylogeny greatly facilitates the identification of the required outgroup. Although the workflow remains a heuristic since the correct outgroup cannot be determined reliably in all cases, simulations with lineage specific biases and rate asymmetries show that nearly perfect results can be achieved. In a realistic setting, where distances data have to be estimated from sequence data and hence are noisy, it is still possible to obtain highly accurate sets of best matches. Conclusion Improvements of tree-free orthology assessment methods can be expected from a combination of the accurate inference of best matches reported here and recent mathematical advances in the understanding of (reciprocal) best match graphs and orthology relations. Availability Accompanying software is available at https://github.com/david-schaller/AsymmeTree .

[1]  Andreas W. M. Dress,et al.  Recovering Symbolically Dated, Rooted Trees from Symbolic Ultrametrics , 1998 .

[2]  A. Force,et al.  Preservation of duplicate genes by complementary, degenerative mutations. , 1999, Genetics.

[3]  Tadashi Imanishi,et al.  A genome-wide survey of changes in protein evolutionary rates across four closely related species of Saccharomyces sensu stricto group , 2007, BMC Evolutionary Biology.

[4]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[5]  Konstantin Klemm,et al.  A Model of Macroevolution as a Branching Process Based on Innovations , 2011, Adv. Complex Syst..

[6]  V. A. Lyubetsky,et al.  Reconciliation of Gene and Species Trees , 2014, BioMed research international.

[7]  Sean B. Carroll,et al.  Gene duplication and the adaptive evolution of a classic genetic switch , 2007, Nature.

[8]  D. Penny,et al.  Outgroup misplacement and phylogenetic inaccuracy under a molecular clock--a simulation study. , 2003, Systematic biology.

[9]  Yan P. Yuan,et al.  Predicting function: from genes to genomes and back. , 1998, Journal of molecular biology.

[10]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[11]  Adrian M. Altenhoff,et al.  Standardized benchmarking in the quest for orthologs , 2016, Nature Methods.

[12]  Antonis Rokas,et al.  Functional divergence for every paralog. , 2014, Molecular biology and evolution.

[13]  Marc Hellmuth,et al.  Biologically feasible gene trees, reconciliation maps and informative triples , 2017, Algorithms for Molecular Biology.

[14]  Laura Wegener Parfrey,et al.  Turning the crown upside down: gene tree parsimony roots the eukaryotic tree of life. , 2012, Systematic biology.

[15]  W. Fitch Distinguishing homologous from analogous proteins. , 1970, Systematic zoology.

[16]  Kevin Atteson,et al.  The Performance of Neighbor-Joining Methods of Phylogenetic Reconstruction , 1999, Algorithmica.

[17]  L. Boykin,et al.  Rooting Trees, Methods for , 2016, Encyclopedia of Evolutionary Biology.

[18]  Vincent Berry,et al.  Models, algorithms and programs for phylogeny reconciliation , 2011, Briefings Bioinform..

[19]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[20]  D. P. Wall,et al.  Detecting putative orthologs , 2003, Bioinform..

[21]  Sean R Eddy,et al.  Where did the BLOSUM62 alignment score matrix come from? , 2004, Nature Biotechnology.

[22]  Katharina T. Huber,et al.  From event-labeled gene trees to species trees , 2012, BMC Bioinformatics.

[23]  Sudhir Kumar,et al.  Molecular clocks: four decades of evolution , 2005, Nature Reviews Genetics.

[24]  A. Tversky,et al.  Additive similarity trees , 1977 .

[25]  Kevin P. Byrne,et al.  Consistent Patterns of Rate Asymmetry and Gene Loss Indicate Widespread Neofunctionalization of Yeast Genes After Whole-Genome Duplication , 2007, Genetics.

[26]  E. Koonin Orthologs, Paralogs, and Evolutionary Genomics 1 , 2005 .

[27]  Lalit Jain,et al.  If it ain't broke, don't fix it: Sparse metric repair , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[28]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[29]  E. Koonin,et al.  Functional and evolutionary implications of gene orthology , 2013, Nature Reviews Genetics.

[30]  Peter F. Stadler,et al.  Best match graphs and reconciliation of gene trees with species trees , 2019, Journal of Mathematical Biology.

[31]  Jerzy Tiuryn,et al.  DLS-trees: A model of evolutionary scenarios , 2006, Theor. Comput. Sci..

[32]  P. Buneman A Note on the Metric Properties of Trees , 1974 .

[33]  W. Fitch Homology a personal view on some of the problems. , 2000, Trends in genetics : TIG.

[34]  D. Penny Criteria for optimising phylogenetic trees and the problem of determining the root of a tree , 1976, Journal of Molecular Evolution.

[35]  Peter F. Stadler,et al.  Phylogenetics beyond biology , 2018, Theory in Biosciences.

[36]  K. Nieselt-Struwe,et al.  Graphs in sequence spaces: a review of statistical geometry. , 1997, Biophysical chemistry.

[37]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[38]  T. Williams,et al.  New substitution models for rooting phylogenetic trees , 2015, Philosophical Transactions of the Royal Society B: Biological Sciences.

[39]  J. Reifman,et al.  QuartetS: a fast and accurate algorithm for large-scale orthology detection , 2011, Nucleic acids research.

[40]  Christophe Dessimoz,et al.  Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods , 2009, PLoS Comput. Biol..

[41]  Walter M. Fitch,et al.  A non-sequential method for constructing trees and hierarchical classifications , 2005, Journal of Molecular Evolution.

[42]  Gaston H. Gonnet,et al.  Orthologous Matrix (OMA) algorithm 2.0: more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference , 2017, Bioinform..

[43]  Inderjit S. Dhillon,et al.  The Metric Nearness Problem , 2008, SIAM J. Matrix Anal. Appl..

[44]  M. Eigen,et al.  Statistical geometry in sequence space: a method of quantitative comparative sequence analysis. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[45]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[46]  Daniel Doerr,et al.  Orthology Detection Combining Clustering and Synteny for Very Large Datasets , 2014, PloS one.

[47]  J. M. S. S. Pereira,et al.  A note on the tree realizability of a distance matrix , 1969 .

[48]  Alfred V. Aho,et al.  Inferring a Tree from Lowest Common Ancestors with an Application to the Optimization of Relational Expressions , 1981, SIAM J. Comput..

[49]  K. Tamura,et al.  Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. , 1992, Molecular biology and evolution.

[50]  D. Gillespie Exact Stochastic Simulation of Coupled Chemical Reactions , 1977 .

[51]  D. Penny,et al.  The problem of rooting rapid radiations. , 2007, Molecular biology and evolution.

[52]  M. Steel The complexity of reconstructing trees from qualitative characters and subtrees , 1992 .

[53]  S. Ho,et al.  Relaxed Phylogenetics and Dating with Confidence , 2006, PLoS biology.

[54]  P. Stadler,et al.  Reciprocal best match graphs , 2019, Journal of mathematical biology.

[55]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[56]  Peter F. Stadler,et al.  Complexity of Modification Problems for Reciprocal Best Match Graphs , 2019, Theor. Comput. Sci..

[57]  E. Koonin Orthologs, paralogs, and evolutionary genomics. , 2005, Annual review of genetics.

[58]  Sonja J. Prohaska,et al.  Proteinortho: Detection of (Co-)orthologs in large-scale analysis , 2011, BMC Bioinformatics.

[59]  Stephanie J. Spielman,et al.  Pyvolve: A Flexible Python Module for Simulating Sequences along Phylogenies , 2015, bioRxiv.

[60]  Sonja J. Prohaska,et al.  Molecular Evolution of Duplicated Ray Finned Fish HoxA Clusters: Increased Synonymous Substitution Rate and Asymmetrical Co-divergence of Coding and Non-coding Sequences , 2005, Journal of Molecular Evolution.

[61]  J. Huelsenbeck,et al.  Potential applications and pitfalls of Bayesian inference of phylogeny. , 2002, Systematic biology.

[62]  Dulce I. Valdivia,et al.  Best match graphs , 2019, Journal of Mathematical Biology.

[63]  Bernard M. E. Moret,et al.  Phylogenetic Inference , 2011, Encyclopedia of Parallel Computing.

[64]  R. Boys,et al.  The Effect of Nonreversibility on Inferring Rooted Phylogenies , 2015, Molecular biology and evolution.

[65]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[66]  Tanja Stadler,et al.  TreeSimGM: Simulating phylogenetic trees under general Bellman–Harris models with lineage‐specific shifts of speciation and extinction in R , 2017, Methods in ecology and evolution.

[67]  Costas D. Maranas,et al.  MetRxn: a knowledgebase of metabolites and reactions spanning metabolic models and databases , 2012, BMC Bioinformatics.

[68]  A. von Haeseler,et al.  Quartet-mapping, a generalization of the likelihood-mapping procedure. , 2001, Molecular biology and evolution.

[69]  Pablo N. Hess,et al.  An empirical test of the midpoint rooting method , 2007, Biological journal of the Linnean Society. Linnean Society of London.

[70]  S. Jeffery Evolution of Protein Molecules , 1979 .

[71]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[72]  Bernard M. E. Moret,et al.  Evaluating synteny for improved comparative studies , 2014, Bioinform..

[73]  Siavash Mirarab,et al.  Minimum variance rooting of phylogenetic trees and implications for species tree reconstruction , 2017, PloS one.

[74]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..