ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter

The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics' Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.

[1]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[2]  Steven J. M. Jones,et al.  Circos: an information aesthetic for comparative genomics. , 2009, Genome research.

[3]  Hing-Fung Ting,et al.  MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices. , 2016, Methods.

[4]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[5]  Heng Li,et al.  FermiKit: assembly-based variant calling for Illumina resequencing data , 2015, Bioinform..

[6]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[7]  Matthew D. Wilkerson,et al.  ABRA: improved coding indel detection via assembly-based realignment , 2014, Bioinform..

[8]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[9]  Russell E. Durrett,et al.  Assembly and diploid architecture of an individual human genome via single-molecule technologies , 2015, Nature Methods.

[10]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[11]  Justin Chu,et al.  Konnector: Connecting paired-end reads using a bloom filter de Bruijn graph , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[12]  I. Birol,et al.  ARCS: Assembly Roundup by Chromium Scaffolding , 2017, bioRxiv.

[13]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[14]  Inanç Birol,et al.  Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data , 2013, Bioinform..

[15]  Benjamin J. Raphael,et al.  Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. , 2013, The New England journal of medicine.

[16]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[18]  Wei Wu,et al.  Concurrent CIC mutations, IDH mutations, and 1p/19q loss distinguish oligodendrogliomas from other cancers , 2012, The Journal of pathology.

[19]  Shaun D. Jackman,et al.  Linuxbrew and Homebrew for cross-platform package management , 2016 .

[20]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[21]  P. Kwok,et al.  A Hybrid Approach for de novo Human Genome Sequence Assembly and Phasing , 2016, Nature Methods.

[22]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[23]  Steven J. M. Jones,et al.  Mutational and structural analysis of diffuse large B-cell lymphoma using whole-genome sequencing. , 2013, Blood.

[24]  C. Nusbaum,et al.  Comprehensive variation discovery in single human genomes , 2014, Nature Genetics.

[25]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[26]  Ryan D. Morin,et al.  Genetic alterations activating kinase and cytokine receptor signaling in high-risk acute lymphoblastic leukemia. , 2012, Cancer cell.

[27]  Justin Chu,et al.  ntHash: recursive nucleotide hashing , 2016, Bioinform..

[28]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[29]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[30]  Heng Li,et al.  BFC: correcting Illumina sequencing errors , 2015, Bioinform..

[31]  Paul Medvedev,et al.  On the representation of de Bruijn graphs , 2014, RECOMB.

[32]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[33]  Shuai Cheng Li,et al.  The difficulty of protein structure alignment under the RMSD , 2013, Algorithms for Molecular Biology.

[34]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[35]  René L. Warren,et al.  Sealer: a scalable gap-closing application for finishing draft genomes , 2015, BMC Bioinformatics.

[36]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[37]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[38]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[39]  A. Gnirke,et al.  High-quality draft assemblies of mammalian genomes from massively parallel sequence data , 2010, Proceedings of the National Academy of Sciences.

[40]  Steven J. M. Jones,et al.  The genetic landscape of high-risk neuroblastoma , 2013, Nature Genetics.

[41]  Ole Schulz-Trieglaff,et al.  NxTrim: optimized trimming of Illumina mate pair reads , 2014, bioRxiv.

[42]  Steven J. M. Jones,et al.  LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads , 2015, GigaScience.

[43]  Lars Arvestad,et al.  Assembly scaffolding with PE-contaminated mate-pair libraries , 2016, Bioinform..

[44]  Justin Chu,et al.  Konnector v2.0: pseudo-long reads from paired-end sequencing data , 2015, BMC Medical Genomics.