Fast and accurate phylogeny reconstruction using filtered spaced-word matches

Motivation: Word‐based or ‘alignment‐free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment‐free programs, however, are less accurate than alignment‐based methods. Results: We propose Filtered Spaced Word Matches (FSWM), a fast alignment‐free approach to estimate phylogenetic distances between large genomic sequences. For a pre‐defined binary pattern of match and don't‐care positions, FSWM rapidly identifies spaced word‐matches between input sequences, i.e. gap‐free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don't‐care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don't‐care positions of the identified spaced‐word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced‐word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment‐free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. Availability and Implementation: The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/ Contact: chris.leimeister@stud.uni‐goettingen.de Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  M. Eisen,et al.  Identifying Cis-Regulatory Sequences by Word Profile Similarity , 2009, PloS one.

[2]  Peter Meinicke,et al.  Remote homology detection based on oligomer distances , 2006, Bioinform..

[3]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[4]  Nagesh V. Honnalli,et al.  Hobbes: optimized gram-based methods for efficient read alignment , 2011, Nucleic acids research.

[5]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[6]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[7]  S. Jeffery Evolution of Protein Molecules , 1979 .

[8]  Peter Meinicke,et al.  Word correlation matrices for protein sequence analysis and remote homology detection , 2008, BMC Bioinformatics.

[9]  Sophie Schbath,et al.  Separating Significant Matches from Spurious Matches in DNA Sequences , 2012, J. Comput. Biol..

[10]  Dirk Erpenbeck,et al.  OrthoSelect: a protocol for selecting orthologous groups in phylogenomics , 2009, BMC Bioinformatics.

[11]  Luís M. S. Russo,et al.  Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis , 2012, Algorithms for Molecular Biology.

[12]  Bernhard Haubold,et al.  andi: Fast and accurate estimation of evolutionary distances between closely related genomes , 2015, Bioinform..

[13]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[14]  Davide Heller,et al.  eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences , 2015, Nucleic Acids Res..

[15]  Friedrich Möller,et al.  Genome comparison without alignment using shortest unique substrings , 2005, BMC Bioinformatics.

[16]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[17]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[18]  B. Chor,et al.  Genomic DNA k-mer spectra: models and modalities , 2009, Genome Biology.

[19]  Burkhard Morgenstern,et al.  Estimating evolutionary distances between genomic sequences from spaced-word matches , 2015, Algorithms for Molecular Biology.

[20]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[21]  Guanghong Zuo,et al.  CVTree3 Web Server for Whole-genome-based and Alignment-free Prokaryotic Phylogeny and Taxonomy , 2015, Genom. Proteom. Bioinform..

[22]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[23]  Maxime Déraspe,et al.  Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons , 2016, BMC Genomics.

[24]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[25]  Peter Meinicke,et al.  UProC: tools for ultra-fast protein domain classification , 2014, Bioinform..

[26]  Huiguang Yi,et al.  Co-phylog: an assembly-free phylogenomic approach for closely related organisms , 2010, Nucleic acids research.

[27]  Siu-Ming Yiu,et al.  A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio , 2011, Bioinform..

[28]  Mauro Leoncini,et al.  Direct vs 2-stage approaches to structured motif finding , 2011, Algorithms for Molecular Biology.

[29]  Burkhard Morgenstern,et al.  rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison , 2015, PLoS Comput. Biol..

[30]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[31]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[32]  Ming Zhang,et al.  Comparing sequences without using alignments: application to HIV/SIV subtyping , 2007, BMC Bioinformatics.

[33]  Martin Vingron,et al.  Research in Computational Molecular Biology, 12th Annual International Conference, RECOMB 2008, Singapore, March 30 - April 2, 2008. Proceedings , 2008, Annual International Conference on Research in Computational Molecular Biology.

[34]  Susana Vinga,et al.  Editorial: Alignment-free methods in computational biology , 2014, Briefings Bioinform..

[35]  Paul Keim,et al.  Whole-Genome-Based Phylogeny and Divergence of the Genus Brucella , 2009, Journal of bacteriology.

[36]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[37]  K. Hatje,et al.  A Phylogenetic Analysis of the Brassicales Clade Based on an Alignment-Free Sequence Comparison Method , 2012, Front. Plant Sci..

[38]  Burkhard Morgenstern,et al.  kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[39]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[40]  Nick V. Grishin,et al.  Phylogeny Reconstruction with Alignment-Free Method That Corrects for Horizontal Gene Transfer , 2016, PLoS Comput. Biol..

[41]  Hong Luo,et al.  CVTree: a phylogenetic tree reconstruction tool based on whole genomes , 2004, Nucleic Acids Res..

[42]  Evgeny M. Zdobnov,et al.  OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs , 2012, Nucleic Acids Res..

[43]  Yongchao Liu,et al.  ALFRED: A Practical Method for Alignment-Free Distance Computation , 2016, J. Comput. Biol..

[44]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[45]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[46]  Bernhard Haubold,et al.  Efficient estimation of pairwise distances between genomes , 2009, Bioinform..

[47]  Thomas Wiehe,et al.  Estimating Mutation Distances from Unaligned Genomes , 2009, J. Comput. Biol..

[48]  Mark A. Ragan,et al.  Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer , 2016, Scientific Reports.

[49]  Matteo Comin,et al.  The Irredundant Class Method for Remote Homology Detection of Protein Sequences , 2011, J. Comput. Biol..

[50]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[51]  Bernhard Haubold,et al.  Alignment-free phylogenetics and population genetics , 2014, Briefings Bioinform..

[52]  Matteo Comin,et al.  Alignment-free phylogeny of whole genomes using underlying subwords , 2012, Algorithms for Molecular Biology.

[53]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[54]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[55]  Francesca Chiaromonte,et al.  Scoring Pairwise Genomic Sequence Alignments , 2001, Pacific Symposium on Biocomputing.

[56]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[57]  Jan Paul Medema,et al.  Betulin Is a Potent Anti-Tumor Agent that Is Enhanced by Cholesterol , 2009, PloS one.

[58]  Siu-Ming Yiu,et al.  MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[59]  Kai Song,et al.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing , 2014, Briefings Bioinform..

[60]  G. Gonnet,et al.  ALF—A Simulation Framework for Genome Evolution , 2011, Molecular biology and evolution.

[61]  Tao Jiang,et al.  Separating metagenomic short reads into genomes via clustering , 2012, Algorithms for Molecular Biology.