LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons

BackgroundTransposable elements are abundant in eukaryotic genomes and it is believed that they have a significant impact on the evolution of gene and chromosome structure. While there are several completed eukaryotic genome projects, there are only few high quality genome wide annotations of transposable elements. Therefore, there is a considerable demand for computational identification of transposable elements. LTR retrotransposons, an important subclass of transposable elements, are well suited for computational identification, as they contain long terminal repeats (LTRs).ResultsWe have developed a software tool LTRharvest for the de novo detection of full length LTR retrotransposons in large sequence sets. LTRharvest efficiently delivers high quality annotations based on known LTR transposon features like length, distance, and sequence motifs. A quality validation of LTRharvest against a gold standard annotation for Saccharomyces cerevisae and Drosophila melanogaster shows a sensitivity of up to 90% and 97% and specificity of 100% and 72%, respectively. This is comparable or slightly better than annotations for previous software tools. The main advantage of LTRharvest over previous tools is (a) its ability to efficiently handle large datasets from finished or unfinished genome projects, (b) its flexibility in incorporating known sequence features into the prediction, and (c) its availability as an open source software.ConclusionLTRharvest is an efficient software tool delivering high quality annotation of LTR retrotransposons. It can, for example, process the largest human chromosome in approx. 8 minutes on a Linux PC with 4 GB of memory. Its flexibility and small space and run-time requirements makes LTRharvest a very competitive candidate for future LTR retrotransposon annotation projects. Moreover, the structured design and implementation and the availability as open source provides an excellent base for incorporating novel concepts to further improve prediction of LTR retrotransposons.

[1]  M. Lynch,et al.  De novo identification of LTR retrotransposons in eukaryotic genomes , 2007, BMC Genomics.

[2]  D. Voytas,et al.  Transposable elements and genome organization: a comprehensive survey of retrotransposons revealed by the complete Saccharomyces cerevisiae genome sequence. , 1998, Genome research.

[3]  Srinivas Aluru,et al.  Efficient algorithms and software for detection of full-length LTR retrotransposons , 2006, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[4]  Zhao Xu,et al.  LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons , 2007, Nucleic Acids Res..

[5]  E. Ganko,et al.  Retrotransposon-gene associations are widespread among D. melanogaster populations. , 2004, Molecular biology and evolution.

[6]  Emmanuelle Lerat,et al.  Sequence divergence within transposable element families in the Drosophila melanogaster genome. , 2003, Genome research.

[7]  J. Jurka,et al.  Molecular paleontology of transposable elements in the Drosophila melanogaster genome , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[8]  S. Kurtz The Vmatch large scale sequence analysis software , 2003 .

[9]  J. McDonald,et al.  Long terminal repeat retrotransposons of Oryza sativa , 2002, Genome Biology.

[10]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[11]  Giorgio Valle,et al.  BIOINFORMATICS ORIGINAL PAPER Sequence analysis RAP: a new computer program for de novo identification of repeated sequences in whole genomes , 2004 .

[12]  J. Jurka Repbase update: a database and an electronic journal of repetitive elements. , 2000, Trends in genetics : TIG.

[13]  M. Ashburner,et al.  The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective , 2002, Genome Biology.

[14]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[15]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[16]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[17]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[18]  J. McDonald,et al.  Long terminal repeat retrotransposons of Mus musculus , 2004, Genome Biology.

[19]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[20]  Srinivas Aluru,et al.  Efficient Algorithms and Software for Detection of Full-Length LTR Retrotransposons , 2005, CSB.

[21]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[22]  John F. McDonald,et al.  LTR_STRUC: a novel search and identification program for LTR retrotransposons , 2003, Bioinform..

[23]  N. Bowen,et al.  Identification, characterization and comparative genomics of chimpanzee endogenous retroviruses , 2006, Genome Biology.

[24]  Eugene W. Myers,et al.  PILER: identification and classification of genomic repeats , 2005, ISMB.

[25]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[26]  Casey M. Bergman,et al.  Combined Evidence Annotation of Transposable Elements in Genome Sequences , 2005, PLoS Comput. Biol..

[27]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[28]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[29]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[30]  D. Voytas,et al.  The diversity of LTR retrotransposons , 2004, Genome Biology.