BMC Bioinformatics BioMed Central Methodology article ReRep: Computational detection of repetitive sequences in genome survey sequences (GSS)

BackgroundGenome survey sequences (GSS) offer a preliminary global view of a genome since, unlike ESTs, they cover coding as well as non-coding DNA and include repetitive regions of the genome. A more precise estimation of the nature, quantity and variability of repetitive sequences very early in a genome sequencing project is of considerable importance, as such data strongly influence the estimation of genome coverage, library quality and progress in scaffold construction. Also, the elimination of repetitive sequences from the initial assembly process is important to avoid errors and unnecessary complexity. Repetitive sequences are also of interest in a variety of other studies, for instance as molecular markers.ResultsWe designed and implemented a straightforward pipeline called ReRep, which combines bioinformatics tools for identifying repetitive structures in a GSS dataset. In a case study, we first applied the pipeline to a set of 970 GSSs, sequenced in our laboratory from the human pathogen Leishmania braziliensis, the causative agent of leishmaniosis, an important public health problem in Brazil. We also verified the applicability of ReRep to new sequencing technologies using a set of 454-reads of an Escheria coli. The behaviour of several parameters in the algorithm is evaluated and suggestions are made for tuning of the analysis.ConclusionThe ReRep approach for identification of repetitive elements in GSS datasets proved to be straightforward and efficient. Several potential repetitive sequences were found in a L. braziliensis GSS dataset generated in our laboratory, and further validated by the analysis of a more complete genomic dataset from the EMBL and Sanger Centre databases. ReRep also identified most of the E. coli K12 repeats prior to assembly in an example dataset obtained by automated sequencing using 454 technology. The parameters controlling the algorithm behaved consistently and may be tuned to the properties of the dataset, in particular to the length of sequencing reads and the genome coverage. ReRep is freely available for academic use at http://bioinfo.pdtis.fiocruz.br/ReRep/.

[1]  Haixu Tang,et al.  De novo repeat classification and fragment assembly , 2004, RECOMB.

[2]  J. Schwartz,et al.  Annotating large genomes with exact word matches. , 2003, Genome research.

[3]  J. Jurka Repbase update: a database and an electronic journal of repetitive elements. , 2000, Trends in genetics : TIG.

[4]  B. Haas,et al.  A clustering method for repeat analysis in DNA sequences , 2001, Genome Biology.

[5]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[6]  Jaap Heringa,et al.  Tracking repeats using significance and transitivity , 2004, ISMB/ECCB.

[7]  Björn Andersson,et al.  DNPTrapper: an assembly editing tool for finishing and analysis of complex repeat regions , 2006, BMC Bioinformatics.

[8]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[9]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[10]  M. Hudson,et al.  Global repeat discovery and estimation of genomic copy number in a large, complex genome using a high-throughput 454 sequence survey , 2007, BMC Genomics.

[11]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[12]  J. Ruiz,et al.  Characterization of LST-R533: uncovering a novel repetitive element in Leishmania. , 2006, International journal for parasitology.

[13]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[14]  Huanming Yang,et al.  RePS: a sequence assembler that masks exact repeats identified from the shotgun data. , 2002, Genome research.

[15]  Brian White,et al.  Comparative genomic analysis of three Leishmania species that cause diverse human disease , 2007, Nature Genetics.

[16]  S. Sunkin,et al.  The size difference between leishmania major friedlin chromosome one homologues is localized to sub-telomeric repeats at one chromosomal end. , 2000, Molecular and biochemical parasitology.

[17]  B. Haas,et al.  The Genome Sequence of Trypanosoma cruzi, Etiologic Agent of Chagas Disease , 2005, Science.

[18]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[19]  N. Rodríguez,et al.  Genomic DNA repeat from Leishmania (Viannia) braziliensis (Venezuelan strain) containing simple repeats and microsatellites , 1997, Parasitology.

[20]  P. Pevzner,et al.  De Novo Repeat Classification and Fragment Assembly , 2004 .

[21]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[22]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[23]  Wim Degrave,et al.  The PDTIS bioinformatics platform: from sequence to function , 2007 .

[24]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[25]  P. Green,et al.  Consed: a graphical tool for sequence finishing. , 1998, Genome research.

[26]  David Haussler,et al.  Sequence landscapes , 1986, Nucleic Acids Res..

[27]  Yong Wang,et al.  Genome Sequencing in Open Microfabricated High Density Picoliter Reactors , 2005 .

[28]  Hilbert J. Kappen,et al.  The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees , 2006, BMC Bioinformatics.

[29]  Gerald M Rubin,et al.  Heterochromatic sequences in a Drosophila whole-genome shotgun assembly , 2002, Genome Biology.

[30]  P. Myler,et al.  A survey of Leishmania braziliensis genome by shotgun sequencing. , 2004, Molecular and biochemical parasitology.

[31]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[32]  G. Fu,et al.  Characterisation of Leishmania telomeres reveals unusual telomeric repeats and conserved telomere-associated sequence. , 1998, Nucleic acids research.

[33]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[34]  B. Wickstead,et al.  Repetitive Elements in Genomes of Parasitic Protozoa , 2003, Microbiology and Molecular Biology Reviews.