De novo identification of repeat families in large genomes

MOTIVATION De novo repeat family identification is a challenging algorithmic problem of great practical importance. As the number of genome sequencing projects increases, there is a pressing need to identify the repeat families present in large, newly sequenced genomes. We develop a new method for de novo identification of repeat families via extension of consensus seeds; our method enables a rigorous definition of repeat boundaries, a key issue in repeat analysis. RESULTS Our RepeatScout algorithm is more sensitive and is orders of magnitude faster than RECON, the dominant tool for de novo repeat family identification in newly sequenced genomes. Using RepeatScout, we estimate that approximately 2% of the human genome and 4% of mouse and rat genomes consist of previously unannotated repetitive sequence. AVAILABILITY Source code is available for download at http://www-cse.ucsd.edu/groups/bioinformatics/software.html

[1]  Benjamin J. Raphael,et al.  A novel method for multiple alignment of sequences with repeated and shuffled elements. , 2004, Genome research.

[2]  Pankaj Agarwal,et al.  The Repeat Pattern Toolkit (RPT): Analyzing the Structure and Evolution of the C. elegans Genome , 1994, ISMB.

[3]  J. Jurka Repbase update: a database and an electronic journal of repetitive elements. , 2000, Trends in genetics : TIG.

[4]  P. Pevzner,et al.  De Novo Repeat Classification and Fragment Assembly , 2004 .

[5]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Eugene W. Myers,et al.  PILER: identification and classification of genomic repeats , 2005, ISMB.

[7]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[8]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[9]  P. Pevzner,et al.  Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse, and rat genomes. , 2004, Genome research.

[10]  H. Kazazian Mobile Elements: Drivers of Genome Evolution , 2004, Science.

[11]  Eugene W. Myers,et al.  A Four Russians algorithm for regular expression pattern matching , 1992, JACM.

[12]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[13]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[14]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[15]  Bin Ma,et al.  Finding similar regions in many strings , 1999, STOC '99.

[16]  Chun Jimmie Ye,et al.  Orthologous Repeats and Mammalian Phylogenetic Inference , 2005 .

[17]  Enno Ohlebusch,et al.  Computation and Visualization of Degenerate Repeats in Complete Genomes , 2000, ISMB.

[18]  P. Pevzner,et al.  Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. , 2004, Genome research.

[19]  Stefan Kurtz,et al.  REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[20]  R. Durbin,et al.  The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics , 2003, PLoS biology.

[21]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[22]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[23]  B. Haas,et al.  A clustering method for repeat analysis in DNA sequences , 2001, Genome Biology.

[24]  J. Jurka,et al.  Repeats in genomic DNA: mining and meaning. , 1998, Current opinion in structural biology.