Distance measures for biological sequences: Some recent approaches

Sequence comparison has become a very essential tool in modern molecular biology. In fact, in biomolecular sequences high similarity usually implies significant functional or structural similarity. Traditional approaches use techniques that are based on sequence alignment able to measure character level differences. However, the recent developments of whole genome sequencing technology give rise to need of similarity measures able to capture the rearrangements involving large segments contained in the sequences. This paper is devoted to illustrate different methods recently introduced for the alignment-free comparison of biological sequences. Goal of the paper is both to highlight the peculiarities of each of such approaches by focusing on its advantages and disadvantages and to find the common features of all these different methods.

[1]  S. Muthukrishnan,et al.  An Improved Algorithm for Sequence Comparison with Block Reversals , 2002, LATIN.

[2]  S. Muthukrishnan,et al.  Approximate nearest neighbors and sequence comparison with block operations , 2000, STOC '00.

[3]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[4]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[5]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[6]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[7]  Daniel P. Lopresti,et al.  Block Edit Models for Approximate String Matching , 1997, Theor. Comput. Sci..

[8]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[9]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.

[10]  Antonio Restivo,et al.  An Extension of the Burrows Wheeler Transform and Applications to Sequence Comparison and Data Compression , 2005, CPM.

[11]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[12]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[13]  Jean-Paul Delahaye,et al.  Transformation distances: a family of dissimilarity measures based on movements of segments , 1999, Bioinform..

[14]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[15]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[16]  Michael Rodeh,et al.  Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[17]  Dan Gusfield,et al.  Algorithms on strings , 1997 .

[18]  S. Gubser Time warps , 2008, 0812.5107.

[19]  Khalid Sayood,et al.  A new sequence distance measure for phylogenetic tree construction , 2003, Bioinform..

[20]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[21]  Uzi Vishkin,et al.  Communication complexity of document exchange , 1999, SODA '00.

[22]  Antonio Restivo,et al.  An extension of the Burrows-Wheeler Transform , 2007, Theor. Comput. Sci..

[23]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[24]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[25]  Graham Cormode,et al.  The string edit distance matching problem with moves , 2007, TALG.

[26]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[27]  Antonio Restivo,et al.  A New Combinatorial Approach to Sequence Comparison , 2005, Theory of Computing Systems.

[28]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[29]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[30]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[31]  Wojciech Rytter,et al.  Jewels of stringology , 2002 .

[32]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[33]  N. Merhav,et al.  A Measure of Relative Entropy between Individual Sequences with Application to Universal Classification , 1993, Proceedings. IEEE International Symposium on Information Theory.

[34]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[35]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[36]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[37]  B. Blaisdell,et al.  Effectiveness of measures requiring and not requiring prior sequence alignment for estimating the dissimilarity of natural sequences , 1989, Journal of Molecular Evolution.

[38]  Pavel A. Pevzner,et al.  Statistical distance between texts and filtration methods in sequence comparison , 1992, Comput. Appl. Biosci..

[39]  H. Wilf,et al.  Uniqueness theorems for periodic functions , 1965 .

[40]  Ronald de Wolf,et al.  Algorithmic Clustering of Music Based on String Compression , 2004, Computer Music Journal.

[41]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[42]  Funda Ergün,et al.  Comparing Sequences with Segment Rearrangements , 2003, FSTTCS.