GTZ: a fast compression and cloud transmission tool optimized for FASTQ files

BackgroundThe dramatic development of DNA sequencing technology is generating real big data, craving for more storage and bandwidth. To speed up data sharing and bring data to computing resource faster and cheaper, it is necessary to develop a compression tool than can support efficient compression and transmission of sequencing data onto the cloud storage.ResultsThis paper presents GTZ, a compression and transmission tool, optimized for FASTQ files. As a reference-free lossless FASTQ compressor, GTZ treats different lines of FASTQ separately, utilizes adaptive context modelling to estimate their characteristic probabilities, and compresses data blocks with arithmetic coding. GTZ can also be used to compress multiple files or directories at once. Furthermore, as a tool to be used in the cloud computing era, it is capable of saving compressed data locally or transmitting data directly into cloud by choice. We evaluated the performance of GTZ on some diverse FASTQ benchmarks. Results show that in most cases, it outperforms many other tools in terms of the compression ratio, speed and stability.ConclusionsGTZ is a tool that enables efficient lossless FASTQ data compression and simultaneous data transmission onto to cloud. It emerges as a useful tool for NGS data storage and transmission in the cloud environment. GTZ is freely available online at: https://github.com/Genetalks/gtz.

[1]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[2]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[3]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[4]  Emmanuel Jeannot,et al.  Adaptive online data compression , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[5]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[6]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[7]  Faraz Hach,et al.  SCALCE: boosting sequence compression algorithms using locally consistent encoding , 2012, Bioinform..

[8]  Yanli Yang,et al.  Light-weight reference-based compression of FASTQ data , 2015, BMC Bioinformatics.

[9]  Sanguthevar Rajasekaran,et al.  LFQC: A lossless compression algorithm for FASTQ files , 2019, Bioinform..

[10]  George Varghese,et al.  Compressing Genomic Sequence Fragments Using SlimGene , 2010, RECOMB.

[11]  Sara P. Garcia,et al.  GReEn: a tool for efficient compression of genome resequencing data , 2011, Nucleic acids research.

[12]  Sebastian Deorowicz,et al.  DSRC 2 - Industry-oriented compression of FASTQ files , 2014, Bioinform..

[13]  Xiaohui Xie,et al.  Data structures and compression algorithms for high-throughput sequencing technologies , 2010, BMC Bioinformatics.

[14]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[15]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.