Using disk based index and box queries for genome sequencing error correction

The vast increase in DNA sequencing capacity over the last decade has quickly turned biology into a dataintensive science. Nevertheless, current sequencers such as Illumia HiSeq have high random per-base error rates, which makes sequencing error correction an indispensable requirement for many sequence analysis applications. Most existing error correction methods demand large expensive memory space, which limits their scalability for handling large datasets. In this paper, we present a new disk based method, called DiskBQcor, for sequencing error correction. DiskBQcor stores k-mers of sequencing genome data along with their associated metadata on inexpensive disk and utilizes a disk based index tree to efficiently process special box queries to obtain relevant k-mers and their occurring frequencies. It then applies a comprehensive voting mechanism and possibly an efficient binary encoding based assembly technique to verify and correct an erroneous base in a genome sequence under various conditions. Our experiments demonstrate that the proposed method is quite promising in error verification and correction for sequencing genome data on disk. keywords: DNA sequencing, error correction, index tree, box query, algorithm.

[1]  Sakti Pramanik,et al.  Space-Partitioning-Based Bulk-Loading for the NSP-Tree in Non-ordered Discrete Data Spaces , 2008, DEXA.

[2]  Leena Salmela,et al.  Correction of sequencing errors in a mixed set of reads , 2010, Bioinform..

[3]  Sakti Pramanik,et al.  A new method for DNA sequencing error verification and correction via an on-disk index tree , 2015, BCB.

[4]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[5]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[6]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[7]  C. Titus Brown,et al.  Crossing the streams: a framework for streaming analysis of short DNA sequencing reads , 2015, PeerJ Prepr..

[8]  Changqing Chen,et al.  The BoND-Tree: An Efficient Indexing Method for Box Queries in Nonordered Discrete Data Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[9]  B. Langmead,et al.  Lighter: fast and memory-efficient sequencing error correction without counting , 2014, Genome Biology.

[10]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[11]  Qingpeng Zhang,et al.  These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure , 2013, PloS one.

[12]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[13]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[14]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[15]  Sakti Pramanik,et al.  A space-partitioning-based indexing method for multidimensional non-ordered discrete data spaces , 2006, TOIS.

[16]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[17]  Sakti Pramanik,et al.  Dynamic indexing for multidimensional non-ordered discrete data spaces using a data-partitioning approach , 2006, TODS.

[18]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2012, Nature Reviews Genetics.

[19]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[20]  S. Kurtz,et al.  A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes , 2008, BMC Genomics.

[21]  Sakti Pramanik,et al.  Bulk-Loading the ND-Tree in Non-ordered Discrete Data Spaces , 2008, DASFAA.

[22]  Weiguo Liu,et al.  A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware , 2010, J. Comput. Biol..