Repseek, a tool to retrieve approximate from large DNA sequences

The importance of genome redundancy has been strongly emphasized in the field of genome dynamics and evolution as well as in medical biology. A repeat is a sequence present twice or more with a high degree of similarity within a larger sequence (e.g. a chromosome) or set of sequences (e.g. a genome with several chromosomes). Each instance of the repeated sub-sequence is called a ’copy’ of the repeat. We use the term ”duplication” to denote any active mechanistic event that creates a repeat. Even though spurious duplication events (or recombination events between repeats) can cause severe disorders [26, 24], repeated elements remain nonetheless a very important driving force of genome evolution [28]. In that respect, the dynamics and the evolution of these redundant sequences have been studied in bacterial genomes [31, 32, 5] as well as in eukaryote genomes [3, 4, 38]. Duplication events can sometimes copy entire coding regions, giving birth to what is often referred as duplicate genes. Those duplicate genes are the raw material leading to the emergence of novel functions and have been extensively studied (for a historical review see [37]). Although the repeats we are interested in encompass a lot of known biological repeated elements (i.e. transposable elements, duplicated genes, DNA-satellites, segmental duplication, etc.) our main concern is not to identify specific families of repeats, but to extract repeats on the sole basis of their sequence similarity and without any prior consideration of their biological function. Unlike RepeatMasker [34], we do not search for already well characterized repeated elements. Furthermore, our primary goal is not to construct families of repeats. This is the objective of dedicated software such as RepeatScout [30] or of clustering algorithms [9, 29], which reconstruct families from pairs of repeats. Of course, our program can be used to feed these clustering algorithms. While there are some widely accepted methods to detect duplicate genes in a genome (for instance based on BLAST or FASTA programs), there is no firmly established technique concerning the detection of repeats in large DNA sequences. The detection of repeats is not a trivial problem and there is no satisfactory methodology available apart from recursive local alignment (using dynamic programming) of sequences with themselves [41]. Such algorithms, however, are quadratic in computation time and in memory usage and

[1]  Piotr Berman,et al.  Alignments without low-scoring regions , 1998, RECOMB '98.

[2]  E. Rocha,et al.  Associations between inverted repeats and the structural evolution of bacterial genomes. , 2003, Genetics.

[3]  Enno Ohlebusch,et al.  The Enhanced Suffix Array and Its Applications to Genome Analysis , 2002, WABI.

[4]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[5]  C. Rodríguez,et al.  Repeated sequences in bacterial chromosomes and plasmids: a glimpse from sequenced genomes. , 1999, Research in microbiology.

[6]  Robert Giegerich,et al.  Efficient implementation of lazy suffix trees , 2003, Softw. Pract. Exp..

[7]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  Arnaud Lefebvre,et al.  FORRepeats: detects repeats on entire chromosomes and between genomes , 2003, Bioinform..

[9]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[10]  E. Coissac,et al.  A comparative study of duplications in bacteria and eukaryotes: the importance of telomeres. , 1997, Molecular biology and evolution.

[11]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[12]  R. Mazzarella,et al.  Duplication and distribution of repetitive elements and non-unique regions in the human genome. , 1997, Gene.

[13]  Dr. Susumu Ohno Evolution by Gene Duplication , 1970, Springer Berlin Heidelberg.

[14]  Eugene W. Myers,et al.  PILER : identification and classification of genomic repeats , 2005 .

[15]  J. Lupski Genomic disorders: structural features of the genome can lead to DNA rearrangements and human disease traits. , 1998, Trends in genetics : TIG.

[16]  Jeroen Raes,et al.  Duplication and divergence: the evolution of new genes and old ideas. , 2004, Annual review of genetics.

[17]  Alain Viari,et al.  Searching for flexible repeated patterns using a non-transitive similarity relation , 1995, Pattern Recognit. Lett..

[18]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[19]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[20]  Pavel A. Pevzner,et al.  De novo identification of repeat families in large genomes , 2005, ISMB.

[21]  Henry Huang,et al.  Homologous recombination in Escherichia coli: dependence on substrate length and homology. , 1986, Genetics.

[22]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[23]  G Achaz,et al.  Analysis of intrachromosomal duplications in yeast Saccharomyces cerevisiae: a possible model for their origin. , 2000, Molecular biology and evolution.

[24]  G Achaz,et al.  Study of intrachromosomal duplications among the eukaryote genomes. , 2001, Molecular biology and evolution.

[25]  Stefan Kurtz,et al.  REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[26]  Dong Kyue Kim,et al.  Linear-Time Construction of Suffix Arrays , 2003, CPM.

[27]  Philippe Chrétienne,et al.  An Algorithm for Finding a Common Structure Shared by a Family of Strings , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Maxime Crochemore,et al.  Factor Oracle: A New Structure for Pattern Matching , 1999, SOFSEM.

[29]  B. Haas,et al.  A clustering method for repeat analysis in DNA sequences , 2001, Genome Biology.

[30]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[31]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Eric Coissac,et al.  Origin and fate of repeats in bacteria , 2002, Nucleic Acids Res..

[33]  P. Pevzner,et al.  De Novo Repeat Classification and Fragment Assembly , 2004 .

[34]  Serge A. Hazout,et al.  A strategy for finding regions of similarity in complete genome sequences , 1998, Bioinform..

[35]  Enno Ohlebusch,et al.  Computation and Visualization of Degenerate Repeats in Complete Genomes , 2000, ISMB.

[36]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[37]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[38]  A Danchin,et al.  Analysis of long repeats in bacterial genomes reveals alternative evolutionary mechanisms in Bacillus subtilis and other competent prokaryotes. , 1999, Molecular biology and evolution.

[39]  Nathan Srebro,et al.  Distribution of short paired duplications in mammalian genomes. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.