Compression of FASTQ and SAM Format Sequencing Data

Storage and transmission of the data produced by modern DNA sequencing instruments has become a major concern, which prompted the Pistoia Alliance to pose the SequenceSqueeze contest for compression of FASTQ files. We present several compression entries from the competition, Fastqz and Samcomp/Fqzcomp, including the winning entry. These are compared against existing algorithms for both reference based compression (CRAM, Goby) and non-reference based compression (DSRC, BAM) and other recently published competition entries (Quip, SCALCE). The tools are shown to be the new Pareto frontier for FASTQ compression, offering state of the art ratios at affordable CPU costs. All programs are freely available on SourceForge. Fastqz: https://sourceforge.net/projects/fastqz/, fqzcomp: https://sourceforge.net/projects/fqzcomp/, and samcomp: https://sourceforge.net/projects/samcomp/.

[1]  X. Dai,et al.  Nuclear colocalization of transcription factor target genes strengthens coregulation in yeast , 2011, Nucleic acids research.

[2]  Andrei N. Kolmogorov,et al.  Logical basis for information theory and probability theory , 1968, IEEE Trans. Inf. Theory.

[3]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[4]  N. Popitsch,et al.  NGC: lossless and lossy compression of aligned high-throughput sequencing data , 2012, Nucleic acids research.

[5]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[6]  Juliane C. Dohm,et al.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems , 2011, Genome Biology.

[7]  Kiyoshi Asai,et al.  Transformations for the compression of FASTQ quality scores of next-generation sequencing data , 2012, Bioinform..

[8]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[9]  Ewan Birney,et al.  The future of DNA sequence archiving , 2012, GigaScience.

[10]  George Varghese,et al.  Compressing Genomic Sequence Fragments Using SlimGene , 2011, J. Comput. Biol..

[11]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[12]  Xiaohui Xie,et al.  Data structures and compression algorithms for high-throughput sequencing technologies , 2010, BMC Bioinformatics.

[13]  Idoia Ochoa,et al.  Lossy Compression of Quality Values via Rate Distortion Theory , 2012, ArXiv.

[14]  Jijun Tang,et al.  Improving Transmission Efficiency of Large Sequence Alignment/Map (SAM) Files , 2011, PloS one.

[15]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[16]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[17]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[18]  Elena Grassi,et al.  KungFQ: A Simple and Powerful Approach to Compress fastq Files , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[19]  Sara P. Garcia,et al.  GReEn: a tool for efficient compression of genome resequencing data , 2011, Nucleic acids research.

[20]  M. Janitz Next Generation Genome Sequencing , 2008 .

[21]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[22]  B. Langmead,et al.  Aligning Short Sequencing Reads with Bowtie , 2010, Current protocols in bioinformatics.

[23]  Matthew V. Mahoney,et al.  Adaptive weighing of context models for lossless data compression , 2005 .

[24]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[25]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[26]  Xiaohui Xie,et al.  Data structures and compression algorithms for high-throughput sequencing technologies - eScholarship , 2010 .

[27]  Hee Joung Hwang,et al.  SOLiDzipper: A High Speed Encoding Method for the Next-Generation Sequencing Data , 2011, Evolutionary bioinformatics online.

[28]  James Lowey,et al.  Bioinformatics Applications Note Sequence Analysis G-sqz: Compact Encoding of Genomic Sequence and Quality Data , 2022 .

[29]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[30]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[31]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[32]  Raffaele Giancarlo,et al.  Textual data compression in computational biology: a synopsis , 2009, Bioinform..

[33]  Faraz Hach,et al.  SCALCE: boosting sequence compression algorithms using locally consistent encoding , 2012, Bioinform..

[34]  Peter Deutsch,et al.  ZLIB Compressed Data Format Specification version 3.3 , 1996, RFC.

[35]  R. Nutter,et al.  Applied Biosystems SOLiD™ System: Ligation‐Based Sequencing , 2008 .

[36]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[37]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[38]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..