Transposome: a toolkit for annotation of transposable element families from unassembled sequence reads

MOTIVATION Transposable elements (TEs) can be found in virtually all eukaryotic genomes and have the potential to produce evolutionary novelty. Despite the broad taxonomic distribution of TEs, the evolutionary history of these sequences is largely unknown for many taxa due to a lack of genomic resources and identification methods. Given that most TE annotation methods are designed to work on genome assemblies, we sought to develop a method to provide a fine-grained classification of TEs from DNA sequence reads. Here, we present a toolkit for the efficient annotation of TE families from low-coverage whole-genome shotgun (WGS) data, enabling the rapid identification of TEs in a large number of taxa. We compared our software, Transposome, with other approaches for annotating repeats from WGS data, and we show that it offers significant improvements in run time and produces more precise estimates of genomic repeat abundance. Transposome may also be used as a general toolkit for working with Next Generation Sequencing (NGS) data, and for constructing custom genome analysis pipelines. AVAILABILITY AND IMPLEMENTATION The source code for Transposome is freely available (http://sestaton.github.io/Transposome), implemented in Perl and is supported on Linux.

[1]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[2]  A. Danchin,et al.  Organised Genome Dynamics in the Escherichia coli Species Results in Highly Diverse Adaptive Paths , 2009, PLoS genetics.

[3]  John Quackenbush,et al.  TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets , 2003, Bioinform..

[4]  S. Kurtz,et al.  A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[5]  Stefan Kurtz,et al.  LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons , 2008, BMC Bioinformatics.

[6]  E. Eichler,et al.  Limitations of next-generation genome sequence assembly , 2011, Nature Methods.

[7]  Casey M. Bergman,et al.  Discovering and detecting transposable elements in genome sequences , 2007, Briefings Bioinform..

[8]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[9]  S. Bridges,et al.  Empirical comparison of ab initio repeat finding programs , 2008, Nucleic acids research.

[10]  Dawn H. Nagel,et al.  The B73 Maize Genome: Complexity, Diversity, and Dynamics , 2009, Science.

[11]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[12]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[13]  S. Kurtz,et al.  Fine-grained annotation and classification of de novo predicted LTR retrotransposons , 2009, Nucleic acids research.

[14]  Robert D. Finn,et al.  Dfam: a database of repetitive DNA based on profile hidden Markov models , 2012, Nucleic Acids Res..

[15]  Petr Novák,et al.  RepeatExplorer: a Galaxy-based web server for genome-wide characterization of eukaryotic repetitive elements from next-generation sequence reads , 2013, Bioinform..

[16]  Zhao Xu,et al.  LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons , 2007, Nucleic Acids Res..

[17]  Cristian Chaparro,et al.  Exceptional Diversity, Non-Random Distribution, and Rapid Evolution of Retroelements in the B73 Maize Genome , 2009, PLoS genetics.

[18]  E. Lerat Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs , 2010, Heredity.