Estimating Evolutionary Distances from Spaced-Word Matches

Alignment-free methods are increasingly used to estimate distances between DNA and protein sequences and to reconstruct phylogenetic trees. Most distance functions used by these methods, however, are heuristic measures of dissimilarity, not based on any explicit model of evolution. Herein, we propose a simple estimator of the evolutionary distance between two DNA sequences calculated from the number of (spaced) word matches between them. We show that this distance function estimates the evolutionary distance between DNA sequences more accurately than other distance measures used by alignment-free methods. In addition, we calculate the variance of the number of (spaced) word matches depending on sequence length and mismatch probability.

[1]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[2]  Burkhard Morgenstern,et al.  Alignment-free sequence comparison with spaced k-mers , 2013, GCB.

[3]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[4]  Saurabh Sinha,et al.  A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[5]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[6]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[7]  François Rodolphe,et al.  DNA, Words and Models: Statistics of Exceptional Words , 2005 .

[8]  M. Waterman,et al.  Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Brian T. Foley,et al.  HIV Sequence Compendium 2018 , 2010 .

[10]  Hong Luo,et al.  CVTree: a phylogenetic tree reconstruction tool based on whole genomes , 2004, Nucleic Acids Res..

[11]  Gilles Didier,et al.  Local Decoding of Sequences and Alignment-Free Comparison , 2006, J. Comput. Biol..

[12]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[13]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[14]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[15]  Burkhard Morgenstern,et al.  kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[16]  Thomas Wiehe,et al.  Estimating Mutation Distances from Unaligned Genomes , 2009, J. Comput. Biol..

[17]  Friedrich Möller,et al.  Genome comparison without alignment using shortest unique substrings , 2005, BMC Bioinformatics.

[18]  Susana Vinga,et al.  Editorial: Alignment-free methods in computational biology , 2014, Briefings Bioinform..

[19]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Yves Van de Peer,et al.  zt: A Sofware Tool for Simple and Partial Mantel Tests , 2002 .

[21]  Volker Brendel,et al.  DNA, Words and Models—Statistics of Exceptional Words by S. Robin, F. Rodolphe, and S. Schbath , 2008 .

[22]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[23]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[24]  Burkhard Morgenstern,et al.  Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..