Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

AbstractBackgroundData volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method.ResultsWe present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software Leon, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.ConclusionsLeon was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. Leon is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/.

[1]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[2]  Peter Deutsch,et al.  ZLIB Compressed Data Format Specification version 3.3 , 1996, RFC.

[3]  Michael Mitzenmacher,et al.  Less hashing, same performance: Building a better Bloom filter , 2006, Random Struct. Algorithms.

[4]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[5]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[6]  Bonnie Berger,et al.  Quality score compression improves genotyping accuracy , 2015, Nature Biotechnology.

[7]  Kiyoshi Asai,et al.  Transformations for the compression of FASTQ quality scores of next-generation sequencing data , 2012, Bioinform..

[8]  Rayan Chikhi,et al.  Reference-free detection of isolated SNPs , 2014, Nucleic acids research.

[9]  Gregory Kucherov,et al.  Using cascading Bloom filters to improve the memory usage for de Brujin graphs , 2013, Algorithms for Molecular Biology.

[10]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[11]  Dominique Lavenier,et al.  GATB: Genome Assembly & Analysis Tool Box , 2014, Bioinform..

[12]  Robert Patro,et al.  Reference-based compression of short-read sequences using path encoding , 2015, Bioinform..

[13]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[14]  Ole Schulz-Trieglaff,et al.  BEETL-fastq: a searchable compressed archive for DNA reads , 2014, Bioinform..

[15]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[16]  Robert Patro,et al.  Data-dependent bucketing improves reference-free compression of sequencing reads , 2015, Bioinform..

[17]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[18]  Bonnie Berger,et al.  Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification , 2014, RECOMB.

[19]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[20]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[21]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[22]  Alistair Moffat,et al.  Lossy compression of quality scores in genomic data , 2014, Bioinform..

[23]  Giovanna Rosone,et al.  Adaptive reference-free compression of sequence quality scores , 2014, Bioinform..

[24]  Orit Shefi,et al.  Spatial regulation dominates gene function in the ganglia chain , 2014, Bioinform..

[25]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[26]  Faraz Hach,et al.  SCALCE: boosting sequence compression algorithms using locally consistent encoding , 2012, Bioinform..

[27]  Szymon Grabowski,et al.  Disk-based compression of data from genome sequencing , 2015, Bioinform..

[28]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[29]  Rayan Chikhi,et al.  Space-efficient and exact de Bruijn graph representation based on a Bloom filter , 2012, Algorithms for Molecular Biology.

[30]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[31]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[32]  Pierre Peterlongo,et al.  Mapping-Free and Assembly-Free Discovery of Inversion Breakpoints from Raw NGS Reads , 2014, AlCoB.