Estimating evolutionary distances between genomic sequences from spaced-word matches

Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator dN of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of ‘match positions’ and ‘don’t care positions’. Our software is available online and as downloadable source code at: http://spaced.gobics.de/.

[1]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[2]  Tao Jiang,et al.  Separating metagenomic short reads into genomes via clustering , 2012, Algorithms for Molecular Biology.

[3]  Friedrich Möller,et al.  Genome comparison without alignment using shortest unique substrings , 2005, BMC Bioinformatics.

[4]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[5]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[6]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Yves Van de Peer,et al.  zt: A Sofware Tool for Simple and Partial Mantel Tests , 2002 .

[8]  Volker Brendel,et al.  DNA, Words and Models—Statistics of Exceptional Words by S. Robin, F. Rodolphe, and S. Schbath , 2008 .

[9]  Bernhard Haubold,et al.  andi: Fast and accurate estimation of evolutionary distances between closely related genomes , 2015, Bioinform..

[10]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[11]  Nairanjana Dasgupta DNA, Words and Models, Statistics of Exceptional Words , 2007, Technometrics.

[12]  Matteo Comin,et al.  The Irredundant Class Method for Remote Homology Detection of Protein Sequences , 2011, J. Comput. Biol..

[13]  Matteo Comin,et al.  Alignment-free phylogeny of whole genomes using underlying subwords , 2012, Algorithms for Molecular Biology.

[14]  Mauro Leoncini,et al.  Direct vs 2-stage approaches to structured motif finding , 2011, Algorithms for Molecular Biology.

[15]  Mary Ann Moran,et al.  Genome characteristics of a generalist marine bacterial lineage , 2010, The ISME Journal.

[16]  Burkhard Morgenstern,et al.  Estimating Evolutionary Distances from Spaced-Word Matches , 2014, WABI.

[17]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[18]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[19]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[20]  Peter Meinicke,et al.  Remote homology detection based on oligomer distances , 2006, Bioinform..

[21]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[22]  Bernhard Haubold,et al.  Alignment-free estimation of nucleotide diversity , 2011, Bioinform..

[23]  Donald E. K. Martin,et al.  A Coverage Criterion for Spaced Seeds and Its Applications to Support Vector Machine String Kernels and k-Mer Distances , 2014, J. Comput. Biol..

[24]  Bernhard Haubold,et al.  Alignment-free detection of local similarity among viral and bacterial genomes , 2011, Bioinform..

[25]  Peter Meinicke,et al.  Word correlation matrices for protein sequence analysis and remote homology detection , 2008, BMC Bioinformatics.

[26]  Rainer Merkl,et al.  Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites , 2004, BMC Bioinformatics.

[27]  Burkhard Morgenstern,et al.  kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[28]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[29]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[30]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[31]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[32]  Donald E. K. Martin,et al.  D ec 2 01 4 A coverage criterion for spaced seeds and its applications to SVM string-kernels and k-mer distances , 2014 .

[33]  Susana Vinga,et al.  Editorial: Alignment-free methods in computational biology , 2014, Briefings Bioinform..

[34]  Tetsuo Shibuya,et al.  The Gapped Spectrum Kernel for Support Vector Machines , 2013, MLDM.

[35]  L. Beutin,et al.  Derivation of Escherichia coli O157:H7 from Its O55:H7 Precursor , 2010, PloS one.

[36]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[37]  Gilles Didier,et al.  Local Decoding of Sequences and Alignment-Free Comparison , 2006, J. Comput. Biol..

[38]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[39]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[40]  Frank Oliver Glöckner,et al.  TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[41]  S. Jeffery Evolution of Protein Molecules , 1979 .

[42]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[43]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[44]  Ming Zhang,et al.  Comparing sequences without using alignments: application to HIV/SIV subtyping , 2007, BMC Bioinformatics.

[45]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[46]  Huiguang Yi,et al.  Co-phylog: an assembly-free phylogenomic approach for closely related organisms , 2010, Nucleic acids research.

[47]  Siu-Ming Yiu,et al.  A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio , 2011, Bioinform..

[48]  Geoffrey I. Webb,et al.  Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs , 2011, Nucleic acids research.

[49]  M. Eisen,et al.  Identifying Cis-Regulatory Sequences by Word Profile Similarity , 2009, PloS one.

[50]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[51]  Nagesh V. Honnalli,et al.  Hobbes: optimized gram-based methods for efficient read alignment , 2011, Nucleic acids research.

[52]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[53]  Hong Luo,et al.  CVTree: a phylogenetic tree reconstruction tool based on whole genomes , 2004, Nucleic Acids Res..

[54]  François Rodolphe,et al.  DNA, Words and Models: Statistics of Exceptional Words , 2005 .

[55]  M. Waterman,et al.  Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Siu-Ming Yiu,et al.  MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[57]  Thomas Wiehe,et al.  Estimating Mutation Distances from Unaligned Genomes , 2009, J. Comput. Biol..

[58]  Petra Perner,et al.  Machine Learning and Data Mining in Pattern Recognition , 2009, Lecture Notes in Computer Science.

[59]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.