REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads

Repeat elements are important components of eukaryotic genomes. One limitation in our understanding of repeat elements is that most analyses rely on reference genomes that are incomplete and often contain missing data in highly repetitive regions that are difficult to assemble. To overcome this problem we develop a new method, REPdenovo, which assembles repeat sequences directly from raw shotgun sequencing data. REPdenovo can construct various types of repeats that are highly repetitive and have low sequence divergence within copies. We show that REPdenovo is substantially better than existing methods both in terms of the number and the completeness of the repeat sequences that it recovers. The key advantage of REPdenovo is that it can reconstruct long repeats from sequence reads. We apply the method to human data and discover a number of potentially new repeats sequences that have been missed by previous repeat annotations. Many of these sequences are incorporated into various parasite genomes, possibly because the filtering process for host DNA involved in the sequencing of the parasite genomes failed to exclude the host derived repeat sequences. REPdenovo is a new powerful computational tool for annotating genomes and for addressing questions regarding the evolution of repeat families. The software tool, REPdenovo, is available for download at https://github.com/Reedwarbler/REPdenovo.

[1]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[2]  M. Batzer,et al.  LSU Digital Commons LSU Digital Commons Mobile element scanning (ME-Scan) identifies thousands of novel Mobile element scanning (ME-Scan) identifies thousands of novel Alu insertions in diverse human populations Alu insertions in diverse human populations , 2022 .

[3]  Sergey Koren,et al.  Corrigendum: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2015, Nature Biotechnology.

[4]  Akira Takahashi,et al.  Transposon Insertion Finder (TIF): a novel program for detection of de novo transpositions of transposable elements , 2014, BMC Bioinformatics.

[5]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[6]  J. Bennetzen,et al.  Nested Retrotransposons in the Intergenic Regions of the Maize Genome , 1996, Science.

[7]  M. Batzer,et al.  Alu repeats and human genomic diversity , 2002, Nature Reviews Genetics.

[8]  Matthias Platzer,et al.  RepARK—de novo creation of repeat libraries from whole-genome NGS reads , 2014, Nucleic acids research.

[9]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[10]  Eugene W. Myers,et al.  Toward Simplifying and Accurately Formulating Fragment Assembly , 1995, J. Comput. Biol..

[11]  Brian T. Lee,et al.  The UCSC Genome Browser database: 2015 update , 2014, Nucleic Acids Res..

[12]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[13]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[14]  Thomas M. Keane,et al.  RetroSeq: transposable element discovery from next-generation sequencing data , 2013, Bioinform..

[15]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[16]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[17]  Robert D. Finn,et al.  Dfam: a database of repetitive DNA based on profile hidden Markov models , 2012, Nucleic Acids Res..

[18]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[19]  H. Kazazian Mobile Elements: Drivers of Genome Evolution , 2004, Science.

[20]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[21]  H H Kazazian,et al.  HUGO—a midlife crisis? , 1998, Nature Genetics.

[22]  D. Petrov,et al.  T-lex: a program for fast and accurate assessment of transposable element presence using next-generation sequencing data , 2010, Nucleic acids research.

[23]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[24]  Zhiping Weng,et al.  TEMP: a computational method for analyzing transposable element polymorphism in populations , 2014, Nucleic acids research.

[25]  M. Batzer,et al.  The impact of retrotransposons on human genome evolution , 2009, Nature Reviews Genetics.