Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences

Abstract Word-based or ‘alignment-free’ sequence comparison has become an active research area in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Here, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees for dozens of whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github: https://github.com/jschellh/ProtSpaM.

[1]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[2]  Mark Borodovsky,et al.  GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses , 2005, Nucleic Acids Res..

[3]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[4]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[5]  Burkhard Morgenstern,et al.  Fast and accurate phylogeny reconstruction using filtered spaced-word matches , 2017, Bioinform..

[6]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[7]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[8]  Bernhard Haubold,et al.  Alignment-free phylogenetics and population genetics , 2014, Briefings Bioinform..

[9]  Ernest K. Lee,et al.  Extracting phylogenetic signal and accounting for bias in whole-genome data sets supports the Ctenophora as sister to remaining Metazoa , 2015, BMC Genomics.

[10]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[11]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[12]  Burkhard Morgenstern,et al.  Phylogeny reconstruction based on the length distribution of k-mismatch common substrings , 2017, Algorithms for Molecular Biology.

[13]  Sagi Snir,et al.  Multi-SpaM: A Maximum-Likelihood Approach to Phylogeny Reconstruction Using Multiple Spaced-Word Matches and Quartet Trees , 2018, RECOMB-CG.

[14]  Huiguang Yi,et al.  Co-phylog: an assembly-free phylogenomic approach for closely related organisms , 2010, Nucleic acids research.

[15]  Thomas K. F. Wong,et al.  ModelFinder: Fast Model Selection for Accurate Phylogenetic Estimates , 2017, Nature Methods.

[16]  Luís M. S. Russo,et al.  Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis , 2012, Algorithms for Molecular Biology.

[17]  Scott V Edwards,et al.  Estimating phylogenetic trees from genome‐scale data , 2015, Annals of the New York Academy of Sciences.

[18]  G. Giribet,et al.  Animal Phylogeny and Its Evolutionary Implications , 2014 .

[19]  Burkhard Morgenstern,et al.  Estimating evolutionary distances between genomic sequences from spaced-word matches , 2015, Algorithms for Molecular Biology.

[20]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[21]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[22]  Jonathan A. Eisen,et al.  Phylogeny of Bacterial and Archaeal Genomes Using Conserved Genes: Supertrees and Supermatrices , 2013, PloS one.

[23]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[24]  H. Tettelin,et al.  Parasitism and mutualism in Wolbachia: what the phylogenomic trees can and cannot say. , 2008, Molecular biology and evolution.

[25]  Kai Song,et al.  Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads , 2013, J. Comput. Biol..

[26]  Umberto Ferraro Petrillo,et al.  Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms , 2018, Bioinform..

[27]  M. Kimura The Neutral Theory of Molecular Evolution: Introduction , 1983 .

[28]  H. Philippe,et al.  Improved Modeling of Compositional Heterogeneity Supports Sponges as Sister to All Other Animals , 2017, Current Biology.

[29]  D. Bryant,et al.  A Simple and Robust Statistical Test for Detecting the Presence of Recombination , 2006, Genetics.

[30]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[31]  Corinne Da Silva,et al.  Phylogenomics Revives Traditional Views on Deep Animal Relationships , 2009, Current Biology.

[32]  Matteo Comin,et al.  Alignment-free phylogeny of whole genomes using underlying subwords , 2012, Algorithms for Molecular Biology.

[33]  Antonis Rokas,et al.  Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets , 2017, bioRxiv.

[34]  Bernhard Haubold,et al.  andi: Fast and accurate estimation of evolutionary distances between closely related genomes , 2015, Bioinform..

[35]  Gesine Reinert,et al.  Alignment-Free Sequence Analysis and Applications. , 2018, Annual review of biomedical data science.

[36]  Yongchao Liu,et al.  ALFRED: A Practical Method for Alignment-Free Distance Computation , 2016, J. Comput. Biol..

[37]  Michael Gerth,et al.  New Wolbachia supergroups detected in quill mites (Acari: Syringophilidae). , 2015, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[38]  T. Jukes,et al.  The neutral theory of molecular evolution. , 2000, Genetics.

[39]  Jonas S. Almeida,et al.  Alignment-free sequence comparison: benefits, applications, and tools , 2017, Genome Biology.

[40]  K. Hatje,et al.  A Phylogenetic Analysis of the Brassicales Clade Based on an Alignment-Free Sequence Comparison Method , 2012, Front. Plant Sci..

[41]  Burkhard Morgenstern,et al.  kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[42]  Mike A. Steel,et al.  Computing the Distribution of a Tree Metric , 2009, IEEE ACM Trans. Comput. Biol. Bioinform..

[43]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[44]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[45]  Lucian Ilie,et al.  SpEED: fast computation of sensitive spaced seeds , 2011, Bioinform..

[46]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[47]  P. Kück,et al.  FASconCAT-G: extensive functions for multiple sequence alignment preparations concerning phylogenetic studies , 2014, Frontiers in Zoology.

[48]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[49]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[50]  Yongchao Liu,et al.  A greedy alignment-free distance estimator for phylogenetic inference , 2015, 2015 IEEE 5th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[51]  Cinzia Pizzi,et al.  MissMax: alignment-free sequence comparison with mismatches through filtering and heuristics , 2016, Algorithms for Molecular Biology.

[52]  Stephanie J. Spielman,et al.  Pyvolve: A Flexible Python Module for Simulating Sequences along Phylogenies , 2015, bioRxiv.

[53]  Burkhard Morgenstern,et al.  Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences , 2018, bioRxiv.

[54]  Se-Ran Jun,et al.  Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution , 2009, Proceedings of the National Academy of Sciences.

[55]  Burkhard Morgenstern,et al.  rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison , 2015, PLoS Comput. Biol..

[56]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[57]  S. Kelly,et al.  OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy , 2015, Genome Biology.

[58]  Olga Chernomor,et al.  Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices , 2016, Systematic biology.

[59]  O. Bininda-Emonds,et al.  The evolution of supertrees. , 2004, Trends in ecology & evolution.

[60]  Hong Luo,et al.  CVTree: a phylogenetic tree reconstruction tool based on whole genomes , 2004, Nucleic Acids Res..

[61]  C. Bleidorn,et al.  Comparative genomics provides a timeframe for Wolbachia evolution and exposes a recent biotin synthesis operon transfer , 2016, Nature Microbiology.

[62]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[63]  Mark A. Ragan,et al.  Pattern-Based Phylogenetic Distance Estimation and Tree Reconstruction , 2006, Evolutionary bioinformatics online.

[64]  L. Beutin,et al.  Derivation of Escherichia coli O157:H7 from Its O55:H7 Precursor , 2010, PloS one.

[65]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[66]  James M. Hogan,et al.  Alignment-free inference of hierarchical and reticulate phylogenomic relationships , 2017, Briefings Bioinform..

[67]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[68]  D. Denver,et al.  Genomic evidence for plant-parasitic nematodes as the earliest Wolbachia hosts , 2016, Scientific Reports.

[69]  Marie-Theres Gansauge,et al.  Phylogenomic analyses uncover origin and spread of the Wolbachia pandemic , 2014, Nature Communications.

[70]  Mike Steel,et al.  Phylogenetic diversity and the greedy algorithm. , 2005, Systematic biology.

[71]  J. Werren,et al.  Wolbachia: master manipulators of invertebrate biology , 2008, Nature Reviews Microbiology.

[72]  Anthony R. Ives,et al.  An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data , 2015, BMC Genomics.

[73]  Colin N. Dewey,et al.  BUCKy: Gene tree/species tree reconciliation with Bayesian concordance analysis , 2010, Bioinform..

[74]  Laurent Noé,et al.  Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds , 2017, Algorithms for Molecular Biology.

[75]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[76]  Peer Bork,et al.  Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees , 2016, Nucleic Acids Res..

[77]  Thomas Wiehe,et al.  Estimating Mutation Distances from Unaligned Genomes , 2009, J. Comput. Biol..