Disk-based genome sequencing data compression

Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk-based~(Yanovsky, 2011; Cox et al., 2012), where the better of these two, from Cox~{\it et al.}~(2012), is based on the Burrows--Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gb human genome sequencing collection with almost 45-fold coverage. Results: We propose ORCOM (Overlapping Reads COmpression with Minimizers), a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gb dataset into only 5.31 GB of space. Availability: this http URL under a free license.

[1]  Faraz Hach,et al.  SCALCE: boosting sequence compression algorithms using locally consistent encoding , 2012, Bioinform..

[2]  Hamidreza Chitsaz,et al.  De novo co-assembly of bacterial genomes from multiple single cells , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.

[3]  Tak Wah Lam,et al.  GPU-Accelerated BWT Construction for Large Collection of Short Reads , 2014, ArXiv.

[4]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[5]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[6]  Giovanna Rosone,et al.  Lightweight BWT Construction for Very Large String Collections , 2011, CPM.

[7]  Yang Li,et al.  Memory Efficient Minimum Substring Partitioning , 2013, Proc. VLDB Endow..

[8]  Alistair Moffat,et al.  Lossy compression of quality scores in genomic data , 2014, Bioinform..

[9]  Kiyoshi Asai,et al.  Transformations for the compression of FASTQ quality scores of next-generation sequencing data , 2012, Bioinform..

[10]  Reducing Whole-Genome Data Storage Footprint , 2012 .

[11]  Michael Roberts,et al.  Reducing storage requirements for biological sequence comparison , 2004, Bioinform..

[12]  Paul Medvedev,et al.  On the representation of de Bruijn graphs , 2014, RECOMB.

[13]  Xin Chen,et al.  SRComp: Short Read Sequence Compression Using Burstsort and Elias Omega Coding , 2013, PloS one.

[14]  Bonnie Berger,et al.  Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification , 2014, RECOMB.

[15]  Sebastian Deorowicz,et al.  DSRC 2 - Industry-oriented compression of FASTQ files , 2014, Bioinform..

[16]  Giovanna Rosone,et al.  Adaptive reference-free compression of sequence quality scores , 2014, Bioinform..

[17]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.

[18]  Giovanni Motta,et al.  Handbook of Data Compression , 2009 .

[19]  Dmitry A. Shkarin,et al.  PPM: one step to practicality , 2002, Proceedings DCC 2002. Data Compression Conference.

[20]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[21]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[22]  Szymon Grabowski,et al.  Data compression for sequencing data , 2013, Algorithms for Molecular Biology.

[23]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[24]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[25]  Vladimir Yanovsky ReCoil - an algorithm for compression of extremely large datasets of dna data , 2010, Algorithms for Molecular Biology.

[26]  Sen Zhang,et al.  Suffix Array Construction in External Memory Using D-Critical Substrings , 2014, TOIS.