Konnector v2.0: pseudo-long reads from paired-end sequencing data

BackgroundReading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local de novo assembly tool that addresses this problem. Here we report on version 2.0 of our tool.ResultsKonnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences.ConclusionsHere we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.

[1]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[2]  Justin Chu,et al.  Konnector: Connecting paired-end reads using a bloom filter de Bruijn graph , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[3]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[4]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[5]  Inanç Birol,et al.  Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data , 2013, Bioinform..

[6]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[7]  Siu-Ming Yiu,et al.  COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly , 2012, Bioinform..

[8]  Alberto Policriti,et al.  GapFiller: a de novo assembly approach to fill the gap within paired reads , 2012, BMC Bioinformatics.

[9]  François Laviolette,et al.  Ray: Simultaneous Assembly of Reads from a Mix of High-Throughput Sequencing Technologies , 2010, J. Comput. Biol..

[10]  Zhen Yue,et al.  pIRS: Profile-based Illumina pair-end reads simulator , 2012, Bioinform..

[11]  Itai Yanai,et al.  ELOPER: elongation of paired-end reads as a pre-processing tool for improved de novo genome assembly , 2013, Bioinform..

[12]  Ken Chen,et al.  VarScan: variant detection in massively parallel sequencing of individual and pooled samples , 2009, Bioinform..

[13]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  René L. Warren,et al.  Sealer: a scalable gap-closing application for finishing draft genomes , 2015, BMC Bioinformatics.

[15]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[16]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[17]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[18]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[19]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[20]  Björn Andersson,et al.  Classification of DNA sequences using Bloom filters , 2010, Bioinform..

[21]  W. Pirovano,et al.  Toward almost closed genomes with GapFiller , 2012, Genome Biology.

[22]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[23]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[24]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[25]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[26]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[27]  Martin Dugas,et al.  RSVSim: an R/Bioconductor package for the simulation of structural variations , 2013, Bioinform..

[28]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..