RepLong: de novo repeat identification using long read sequencing data

Motivation The identification of repetitive elements is important in genome assembly and phylogenetic analyses. The existing de novo repeat identification methods exploiting the use of short reads are impotent in identifying long repeats. Since long reads are more likely to cover repeat regions completely, using long reads is more favorable for recognizing long repeats. Results In this study, we propose a novel de novo repeat elements identification method namely RepLong based on PacBio long reads. Given that the reads mapped to the repeat regions are highly overlapped with each other, the identification of repeat elements is equivalent to the discovery of consensus overlaps between reads, which can be further cast into a community detection problem in the network of read overlaps. In RepLong, we first construct a network of read overlaps based on pair-wise alignment of the reads, where each vertex indicates a read and an edge indicates a substantial overlap between the corresponding two reads. Secondly, the communities whose intra connectivity is greater than the inter connectivity are extracted based on network modularity optimization. Finally, representative reads in each community are extracted to form the repeat library. Comparison studies on Drosophila melanogaster and human long read sequencing data with genome-based and short-read-based methods demonstrate the efficiency of RepLong in identifying long repeats. RepLong can handle lower coverage data and serve as a complementary solution to the existing methods to promote the repeat identification performance on long-read sequencing data. Availability and implementation The software of RepLong is freely available at https://github.com/ruiguo-bio/replong. Contact ywsun@szu.edu.cn or zhuzx@szu.edu.cn. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Steve Harenberg,et al.  Community detection in large‐scale networks: a survey and empirical evaluation , 2014 .

[3]  M. Batzer,et al.  Repetitive Elements May Comprise Over Two-Thirds of the Human Genome , 2011, PLoS genetics.

[4]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[5]  Matthias Zytnicki,et al.  Tedna: a transposable element de novo assembler , 2014, Bioinform..

[6]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[7]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[8]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[9]  Sean R Eddy,et al.  The C-value paradox, junk DNA and ENCODE , 2012, Current Biology.

[10]  Guillaume Fertin,et al.  Hybrid de novo tandem repeat detection using short and long reads , 2015, BMC Medical Genomics.

[11]  Brent S. Pedersen,et al.  Efficient "pythonic" access to FASTA files using pyfaidx , 2015, PeerJ Prepr..

[12]  David R. Kelley,et al.  Transposable elements modulate human RNA abundance and splicing via specific RNA-protein interactions , 2014, Genome Biology.

[13]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[14]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[15]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[16]  Pavel A. Pevzner,et al.  De novo identification of repeat families in large genomes , 2005, ISMB.

[17]  Matthias Platzer,et al.  RepARK—de novo creation of repeat libraries from whole-genome NGS reads , 2014, Nucleic acids research.

[18]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[19]  S. Eddy,et al.  Automated de novo identification of repeat sequence families in sequenced genomes. , 2002, Genome research.

[20]  C. Schlötterer Evolutionary dynamics of microsatellite DNA , 2000, Chromosoma.

[21]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[22]  Yufeng Wu,et al.  REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads , 2016, PloS one.

[23]  Dawn H. Nagel,et al.  The B73 Maize Genome: Complexity, Diversity, and Dynamics , 2009, Science.

[24]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[25]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[26]  J. Bennetzen,et al.  The contributions of transposable elements to the structure, function, and evolution of plant genomes. , 2014, Annual review of plant biology.

[27]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[28]  Casey M. Bergman,et al.  Discovering and detecting transposable elements in genome sequences , 2007, Briefings Bioinform..

[29]  Michael Ashburner,et al.  Recurrent insertion and duplication generate networks of transposable element sequences in the Drosophila melanogaster genome , 2006, Genome Biology.

[30]  Zhao Yang,et al.  A Comparative Analysis of Community Detection Algorithms on Artificial Networks , 2016, Scientific Reports.

[31]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[32]  Jian Wang,et al.  ReAS: Recovery of Ancestral Sequences for Transposable Elements from the Unassembled Reads of a Whole Genome Shotgun , 2005, PLoS Comput. Biol..

[33]  Eugene W. Myers,et al.  PILER: identification and classification of genomic repeats , 2005, ISMB.

[34]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[35]  J. Landolin,et al.  Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing , 2014 .