DSRC 2 - Industry-oriented compression of FASTQ files

SUMMARY Modern sequencing platforms produce huge amounts of data. Archiving them raises major problems but is crucial for reproducibility of results, one of the most fundamental principles of science. The widely used gzip compressor, used for reduction of storage and transfer costs, is not a perfect solution, so a few specialized FASTQ compressors were proposed recently. Unfortunately, they are often impractical because of slow processing, lack of support for some variants of FASTQ files or instability. We propose DSRC 2 that offers compression ratios comparable with the best existing solutions, while being a few times faster and more flexible. AVAILABILITY AND IMPLEMENTATION DSRC 2 is freely available at http://sun.aei.polsl.pl/dsrc. The package contains command-line compressor, C and Python libraries for easy integration with existing software and technical documentation with examples of usage. CONTACT sebastian.deorowicz@polsl.pl SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[2]  Giovanni Motta,et al.  Handbook of Data Compression , 2009 .

[3]  Szymon Grabowski,et al.  Data compression for sequencing data , 2013, Algorithms for Molecular Biology.

[4]  Mark Howison High-Throughput Compression of FASTQ Data with SeqDB , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Josh P Roberts Million veterans sequenced , 2013, Nature Biotechnology.

[6]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[7]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[8]  Euan A Ashley,et al.  A public resource facilitating clinical use of genomes , 2012, Proceedings of the National Academy of Sciences.