DBG2OLC: Efficient Assembly of Large Genomes Using the Compressed Overlap Graph

The highly anticipated transition to the third generation DNA sequencing (3rdGS) technology have reached a stalemate primarily due to the high error rates (15-45%), which make the assembly of long erroneous reads extremely challenging because existing software solutions for 3rdGS assembly are often overwhelmed by error correction tasks. We report three significant breakthroughs that push the envelope of genome assembly and offer an enabling software solution to overcome the current 3rdGS stalemate. Firstly, we take a counter-intuitive strategy and develop a base-level correction-free assembly algorithm, which resorts to data compression technology and the assembly was performed with the compressed reads. Magnitudes of compression lead to magnitudes of reduction in read lengths, enabling magnitudes of savings in computational time and memory space. We implement the new algorithm in a proof-of-concept software package DBG2OLC. Experiments with the 3rdGS data including PacBio and Oxford Nanopore show that our method is able to assemble large genomes magnitudes more efficiently than existing methods. For example, on a large PacBio human genome dataset we calculated the all-pair alignment of 54x erroneous long reads in 6 hours compared to the 405,000 CPU hours previously reported by Pacific Biosciences. Secondly, while maintaining comparable high quality assemblies, our approach requires significantly lower sequencing coverage (10x-20x) than existing assemblers, which translates to significant cost-cut for genome sequencing. Thirdly, our method is highly adaptive and is the only one to date that demonstrates ultra efficiencies not only for the 3rdGS PacBio and Nanapore sequences, but also for the latest NGS data.

[1]  S. Turner,et al.  Zero-Mode Waveguides for Single-Molecule Analysis at High Concentrations , 2003, Science.

[2]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[3]  Mark J. P. Chaisson,et al.  De novo fragment assembly with short mate-paired reads: Does the read length matter? , 2009, Genome research.

[4]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[5]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[6]  Tyson A. Clark,et al.  Modeling kinetic rate variation in third generation DNA sequencing data to detect putative modifications to DNA bases , 2013, Genome research.

[7]  Rayan Chikhi,et al.  Space-Efficient and Exact de Bruijn Graph Representation Based on a Bloom Filter , 2012, WABI.

[8]  Thomas C. Conway,et al.  Succinct data structures for assembling large genomes , 2010, Bioinform..

[9]  H. Brenner,et al.  Capture of a Single Molecule in a Nanocavity , 2001 .

[10]  Dan Gusfield,et al.  Algorithms in Bioinformatics , 2002, Lecture Notes in Computer Science.

[11]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[12]  W. Wong,et al.  Improving PacBio Long Read Accuracy by Short Read Alignment , 2012, PloS one.

[13]  W. Pearson,et al.  Current Protocols in Bioinformatics , 2002 .

[14]  J. Landolin,et al.  Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing , 2014 .

[15]  James H. Bullard,et al.  A hybrid approach for the automated finishing of bacterial genomes , 2012, Nature Biotechnology.

[16]  Richard J. Roberts,et al.  The methylomes of six bacteria , 2012, Nucleic acids research.

[17]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[18]  Shoshana Marcus,et al.  Error correction and assembly complexity of single molecule sequencing reads , 2014, bioRxiv.

[19]  Zechen Chong,et al.  Pseudo-Sanger sequencing: massively parallel production of long and near error-free reads using NGS technology , 2013, BMC Genomics.

[20]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[21]  Walter Pirovano,et al.  SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information , 2014, BMC Bioinformatics.

[22]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[23]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[24]  Tyson A. Clark,et al.  Direct detection of DNA methylation during single-molecule, real-time sequencing , 2010, Nature Methods.

[25]  Florent E. Angly,et al.  Next Generation Sequence Assembly with AMOS , 2011, Current protocols in bioinformatics.

[26]  Tyson A. Clark,et al.  Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-time sequencing , 2012, Nature Biotechnology.

[27]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[28]  Mark J. P. Chaisson,et al.  Reconstructing complex regions of genomes using long-read sequencing technology , 2014, Genome research.

[29]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[30]  C. Nusbaum,et al.  Finished bacterial genomes from shotgun sequence data , 2012, Genome research.

[31]  Mihai Pop,et al.  Exploiting sparseness in de novo genome assembly , 2012, BMC Bioinformatics.

[32]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[33]  Wing Hung Wong,et al.  Characterization of the human ESC transcriptome by hybrid sequencing , 2013, Proceedings of the National Academy of Sciences.

[34]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[35]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[36]  Tyson A. Clark,et al.  Sensitive and specific single-molecule sequencing of 5-hydroxymethylcytosine , 2011, Nature Methods.

[37]  S. Kim,et al.  AMASS: A Structured Pattern Matching Approach to Shotgun Sequence Assembly , 1998, J. Comput. Biol..

[38]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[39]  Sara El-Metwally,et al.  Next-Generation Sequence Assembly: Four Stages of Data Processing and Computational Challenges , 2013, PLoS Comput. Biol..

[40]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[41]  Sean Thomas,et al.  Long-Read Sequencing of Chicken Transcripts and Identification of New Transcript Isoforms , 2014, PloS one.

[42]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[43]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[44]  S. Koren,et al.  One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. , 2015, Current opinion in microbiology.

[45]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[46]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[47]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..