论文信息 - Estimating evolutionary distances between genomic sequences from spaced-word matches

Estimating evolutionary distances between genomic sequences from spaced-word matches

Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator dN of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of ‘match positions’ and ‘don’t care positions’. Our software is available online and as downloadable source code at: http://spaced.gobics.de/.

[1] Yu-Wei Wu,et al. A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[2] Tao Jiang,et al. Separating metagenomic short reads into genomes via clustering , 2012, Algorithms for Molecular Biology.

[3] Friedrich Möller,et al. Genome comparison without alignment using shortest unique substrings , 2005, BMC Bioinformatics.

[4] Morteza Mohammad Noori,et al. Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[5] Cole Trapnell,et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[6] B. Blaisdell. A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[7] Yves Van de Peer,et al. zt: A Sofware Tool for Simple and Partial Mantel Tests , 2002 .

[8] Volker Brendel,et al. DNA, Words and Models—Statistics of Exceptional Words by S. Robin, F. Rodolphe, and S. Schbath , 2008 .

[9] Bernhard Haubold,et al. andi: Fast and accurate estimation of evolutionary distances between closely related genomes , 2015, Bioinform..

[10] Jianhua Lin,et al. Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[11] Nairanjana Dasgupta. DNA, Words and Models, Statistics of Exceptional Words , 2007, Technometrics.

[12] Matteo Comin,et al. The Irredundant Class Method for Remote Homology Detection of Protein Sequences , 2011, J. Comput. Biol..

[13] Matteo Comin,et al. Alignment-free phylogeny of whole genomes using underlying subwords , 2012, Algorithms for Molecular Biology.

[14] Mauro Leoncini,et al. Direct vs 2-stage approaches to structured motif finding , 2011, Algorithms for Molecular Biology.

[15] Mary Ann Moran,et al. Genome characteristics of a generalist marine bacterial lineage , 2010, The ISME Journal.

[16] Burkhard Morgenstern,et al. Estimating Evolutionary Distances from Spaced-Word Matches , 2014, WABI.

[17] T. Jukes. CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[18] Zhaojun Bai,et al. CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[19] Saurabh Sinha,et al. A statistical method for alignment-free comparison of regulatory sequences , 2007, ISMB/ECCB.

[20] Peter Meinicke,et al. Remote homology detection based on oligomer distances , 2006, Bioinform..

[21] David Burstein,et al. The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[22] Bernhard Haubold,et al. Alignment-free estimation of nucleotide diversity , 2011, Bioinform..

[23] Donald E. K. Martin,et al. A Coverage Criterion for Spaced Seeds and Its Applications to Support Vector Machine String Kernels and k-Mer Distances , 2014, J. Comput. Biol..

[24] Bernhard Haubold,et al. Alignment-free detection of local similarity among viral and bacterial genomes , 2011, Bioinform..

[25] Peter Meinicke,et al. Word correlation matrices for protein sequence analysis and remote homology detection , 2008, BMC Bioinformatics.

[26] Rainer Merkl,et al. Oligo kernels for datamining on biological sequences: a case study on prokaryotic translation initiation sites , 2004, BMC Bioinformatics.

[27] Burkhard Morgenstern,et al. kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison , 2014, Bioinform..

[28] Eleazar Eskin,et al. The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[29] Gesine Reinert,et al. Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[30] M. P. Cummings. PHYLIP (Phylogeny Inference Package) , 2004 .

[31] Burkhard Morgenstern,et al. Fast alignment-free sequence comparison using spaced-word frequencies , 2014, Bioinform..

[32] Donald E. K. Martin,et al. D ec 2 01 4 A coverage criterion for spaced seeds and its applications to SVM string-kernels and k-mer distances , 2014 .

[33] Susana Vinga,et al. Editorial: Alignment-free methods in computational biology , 2014, Briefings Bioinform..

[34] Tetsuo Shibuya,et al. The Gapped Spectrum Kernel for Support Vector Machines , 2013, MLDM.

[35] L. Beutin,et al. Derivation of Escherichia coli O157:H7 from Its O55:H7 Precursor , 2010, PloS one.

[36] Bin Ma,et al. PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[37] Gilles Didier,et al. Local Decoding of Sequences and Alignment-Free Comparison , 2006, J. Comput. Biol..

[38] D. Robinson,et al. Comparison of phylogenetic trees , 1981 .

[39] Se-Ran Jun,et al. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[40] Frank Oliver Glöckner,et al. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences , 2004, BMC Bioinformatics.

[41] S. Jeffery. Evolution of Protein Molecules , 1979 .

[42] Rob Patro,et al. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[43] E. Birney,et al. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[44] Ming Zhang,et al. Comparing sequences without using alignments: application to HIV/SIV subtyping , 2007, BMC Bioinformatics.

[45] Klas Hatje,et al. Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[46] Huiguang Yi,et al. Co-phylog: an assembly-free phylogenomic approach for closely related organisms , 2010, Nucleic acids research.

[47] Siu-Ming Yiu,et al. A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio , 2011, Bioinform..

[48] Geoffrey I. Webb,et al. Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs , 2011, Nucleic acids research.

[49] M. Eisen,et al. Identifying Cis-Regulatory Sequences by Word Profile Similarity , 2009, PloS one.

[50] Ruiqiang Li,et al. SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[51] Nagesh V. Honnalli,et al. Hobbes: optimized gram-based methods for efficient read alignment , 2011, Nucleic acids research.

[52] N. Saitou,et al. The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[53] Hong Luo,et al. CVTree: a phylogenetic tree reconstruction tool based on whole genomes , 2004, Nucleic Acids Res..

[54] François Rodolphe,et al. DNA, Words and Models: Statistics of Exceptional Words , 2005 .

[55] M. Waterman,et al. Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[56] Siu-Ming Yiu,et al. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[57] Thomas Wiehe,et al. Estimating Mutation Distances from Unaligned Genomes , 2009, J. Comput. Biol..

[58] Petra Perner,et al. Machine Learning and Data Mining in Pattern Recognition , 2009, Lecture Notes in Computer Science.

[59] Robert Patro,et al. Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.