FORRepeats: detects repeats on entire chromosomes and between genomes

MOTIVATION As more and more whole genomes are available, there is a need for new methods to compare large sequences and transfer biological knowledge from annotated genomes to related new ones. BLAST is not suitable to compare multimegabase DNA sequences. MegaBLAST is designed to compare closely related large sequences. Some tools to detect repeats in large sequences have already been developed such as MUMmer or REPuter. They also have time or space restrictions. Moreover, in terms of applications, REPuter only computes repeats and MUMmer works better with related genomes. RESULTS We present a heuristic method, named FORRepeats, which is based on a novel data structure called factor oracle. In the first step it detects exact repeats in large sequences. Then, in the second step, it computes approximate repeats and performs pairwise comparison. We compared its computational characteristics with BLAST and REPuter. Results demonstrate that it is fast and space economical. We show FORRepeats ability to perform intra-genomic comparison and to detect repeated DNA sequences in the complete genome of the model plant Arabidopsis thaliana.

[1]  M. Estelle,et al.  A highly repeated DNA sequence in Arabidopsis thaliana , 1986, Molecular and General Genetics MGG.

[2]  J. S. Heslop-Harrison,et al.  Genomes, genes and junk: the large-scale organization of plant chromosomes , 1998 .

[3]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[4]  S. Karlin,et al.  Comparative DNA analysis across diverse genomes. , 1998, Annual review of genetics.

[5]  Enno Ohlebusch,et al.  Computation and Visualization of Degenerate Repeats in Complete Genomes , 2000, ISMB.

[6]  Webb Miller,et al.  Genome Sequence Comparisons: Hurdles in the Fast Lane to Functional Genomics , 2000, Briefings Bioinform..

[7]  C. Pikaard,et al.  Two-dimensional RFLP analyses reveal megabase-sized clusters of rRNA gene variants in Arabidopsis thaliana, suggesting local spreading of variants as the mode for gene homogenization during concerted evolution. , 1996, The Plant journal : for cell and molecular biology.

[8]  J. Bennetzen Comparative Sequence Analysis of Plant Nuclear Genomes: Microcolinearity and Its Many Exceptions , 2000, Plant Cell.

[9]  Webb Miller,et al.  Comparison of genomic DNA sequences: solved and unsolved problems , 2001, Bioinform..

[10]  Eugen C. Buehler,et al.  Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana , 1999, Nature.

[11]  C. Sander,et al.  Computational comparisons of model genomes. , 1996, Trends in biotechnology.

[12]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[13]  J. Deragon,et al.  Athila, a new retroelement from Arabidopsis thaliana , 1995, Plant Molecular Biology.

[14]  W. Doolittle,et al.  Microbial genomes: dealing with diversity. , 2001, Current opinion in microbiology.

[15]  J L Risler,et al.  Massive sequence comparisons as a help in annotating genomic sequences. , 2001, Genome research.

[16]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[17]  E. Richards,et al.  Arabidopsis thaliana centromere regions: genetic map positions and repetitive DNA structure. , 1997, Genome research.

[18]  M. Cotton,et al.  Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana , 1999, Nature.

[19]  J. S. Heslop-Harrison,et al.  Polymorphisms and Genomic Organization of Repetitive DNA from Centromeric Regions of Arabidopsis Chromosomes , 1999, Plant Cell.

[20]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[21]  Maxime Crochemore,et al.  Factor Oracle: A New Structure for Pattern Matching , 1999, SOFSEM.