Non Hybrid Long Read Consensus Using Local De Bruijn Graph Assembly

While second generation sequencing led to a vast increase in sequenced data, the shorter reads which came with it made assembly a much harder task and for some regions impossible with only short read data. This changed again with the advent of third generation long read sequencers. The length of the long reads allows a much better resolution of repetitive regions, their high error rate however is a major challenge. Using the data successfully requires to remove most of the sequencing errors. The first hybrid correction methods used low noise second generation data to correct third generation data, but this approach has issues when it is unclear where to place the short reads due to repeats and also because second generation sequencers fail to sequence some regions which third generation sequencers work on. Later non hybrid methods appeared. We present a new method for non hybrid long read error correction based on De Bruijn graph assembly of short windows of long reads with subsequent combination of these correct windows to corrected long reads. Our experiments show that this method yields a better correction than other state of the art non hybrid correction approaches.

[1]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[2]  Eugene W. Myers,et al.  Efficient Local Alignment Discovery amongst Noisy Long Reads , 2014, WABI.

[3]  Timothy P. L. Smith,et al.  Reducing assembly complexity of microbial genomes with single-molecule sequencing , 2013, Genome Biology.

[4]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[5]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[6]  Esko Ukkonen,et al.  Accurate self-correction of errors in long reads using de Bruijn graphs , 2016, Bioinform..

[7]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[8]  Pavel A. Pevzner,et al.  Assembly of long error-prone reads using de Bruijn graphs , 2016, Proceedings of the National Academy of Sciences.

[9]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[10]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[11]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[12]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[13]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  R. Jewkes,et al.  Perceptions and Experiences of Research Participants on Gender-Based Violence Community Based Survey: Implications for Ethical Guidelines , 2012, PloS one.

[15]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[16]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[17]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[18]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[19]  Piet Demeester,et al.  Jabba: hybrid error correction for long sequencing reads , 2015, Algorithms for Molecular Biology.

[20]  Johannes Fischer,et al.  Optimal Succinctness for Range Minimum Queries , 2008, LATIN.

[21]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[22]  W. Wong,et al.  Improving PacBio Long Read Accuracy by Short Read Alignment , 2012, PloS one.

[23]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[24]  Simon J. Puglisi,et al.  Range Quantile Queries: Another Virtue of Wavelet Trees , 2009, SPIRE.