A Lossy Compression Technique Enabling Duplication-Aware Sequence Alignment

In spite of the recognized importance of tandem duplications in genome evolution, commonly adopted sequence comparison algorithms do not take into account complex mutation events involving more than one residue at the time, since they are not compliant with the underlying assumption of statistical independence of adjacent residues. As a consequence, the presence of tandem repeats in sequences under comparison may impair the biological significance of the resulting alignment. Although solutions have been proposed, repeat-aware sequence alignment is still considered to be an open problem and new efficient and effective methods have been advocated. The present paper describes an alternative lossy compression scheme for genomic sequences which iteratively collapses repeats of increasing length. The resulting approximate representations do not contain tandem duplications, while retaining enough information for making their comparison even more significant than the edit distance between the original sequences. This allows us to exploit traditional alignment algorithms directly on the compressed sequences. Results confirm the validity of the proposed approach for the problem of duplication-aware sequence alignment.

[1]  Matteo Comin,et al.  Mining, compressing and classifying with extensible motifs , 2006, Algorithms for Molecular Biology.

[2]  Alessandro Bogliolo,et al.  A Monte Carlo Method for Assessing the Quality of Duplication-Aware Alignment Algorithms , 2011, Evolutionary bioinformatics online.

[3]  D. Robinson,et al.  Comparison of weighted labelled trees , 1979 .

[4]  R. Glynn,et al.  The Wilcoxon Signed Rank Test for Paired Comparisons of Clustered Data , 2006, Biometrics.

[5]  Sonja J. Prohaska,et al.  Multiple sequence alignment with user-defined anchor points , 2006, Algorithms for Molecular Biology.

[6]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[7]  Dan Geiger,et al.  Finding approximate tandem repeats in genomic sequences , 2004, RECOMB.

[8]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[9]  J. Claverie Computational methods for the identification of genes in vertebrate genomic sequences. , 1997, Human molecular genetics.

[10]  Eric Rivals,et al.  STAR: an algorithm to Search for Tandem Approximate Repeats , 2004, Bioinform..

[11]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[12]  Christian Schlötterer,et al.  Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species. , 2003, Genome research.

[13]  Philipp W. Messer,et al.  The majority of recent short DNA insertions in the human genome are tandem duplications. , 2007, Molecular biology and evolution.

[14]  Gad M. Landau,et al.  An Algorithm for Approximate Tandem Repeats , 1993, CPM.

[15]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[16]  Eric Rivals,et al.  Detecting microsatellites within genomes: significant variation among algorithms , 2007, BMC Bioinformatics.

[17]  Jens Stoye,et al.  Comparing Tandem Repeats with Duplications and Excisions of Variable Degree , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Jens Stoye,et al.  Simple and flexible detection of contiguous repeats using a suffix tree , 2002, Theor. Comput. Sci..

[19]  A. Jeffreys,et al.  Minisatellite repeat coding as a digital approach to DNA typing , 1991, Nature.

[20]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[21]  Olivier Gascuel,et al.  A Fast and Specific Alignment Method for Minisatellite Maps , 2006, Evolutionary bioinformatics online.

[22]  C. E. Pearson,et al.  Repeat instability: mechanisms of dynamic mutations , 2005, Nature Reviews Genetics.

[23]  Gary Benson,et al.  Sequence alignment with tandem duplication , 1997, RECOMB '97.