LCTD: A lossless compression tool of FASTQ file based on transformation of original file distribution

In this paper, we propose a non-reference based and lossless compression tool of FASTQ which is commonly used to store the NGS. Instead of elaborating excellent data structure and compression technique based on the original FASTQ file, we try to change the distribution of original FASTQ file so as to make it better for further compression by existing compression tools. Experimental results indicate that our method outperforms all the six state-of-the-art compression tools and achieves up to 10% ∼ 43% improvement in terms of the average compression ratio. Besides, our compression tool LCTD outperforms Fastqz in both compression ratio and speed and the latter compression tool Fastqz wins the world champion of compression competition SequenceSqueeze. The source program is available by sending email to us.

[1]  Szymon Grabowski,et al.  Disk-based compression of data from genome sequencing , 2015, Bioinform..

[2]  Lin Fang,et al.  GPU-accelerated adaptive compression framework for genomics data , 2013, 2013 IEEE International Conference on Big Data.

[3]  Khalid Sayood,et al.  Compression of Quality Factors in Next Generation Sequencing , 2014, 2014 Data Compression Conference.

[4]  Faraz Hach,et al.  SCALCE: boosting sequence compression algorithms using locally consistent encoding , 2012, Bioinform..

[5]  M. Janitz Next Generation Genome Sequencing , 2008 .

[6]  Wayne Luk,et al.  FPGA acceleration of reference-based compression for genomic data , 2015, 2015 International Conference on Field Programmable Technology (FPT).

[7]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[8]  Sebastian Deorowicz,et al.  DSRC 2 - Industry-oriented compression of FASTQ files , 2014, Bioinform..

[9]  Dominique Lavenier,et al.  Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph , 2015, BMC Bioinformatics.

[10]  Szymon Grabowski,et al.  Data compression for sequencing data , 2013, Algorithms for Molecular Biology.

[11]  Achuthsankar S. Nair,et al.  NGS read data compression using parallel computing algorithm , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[12]  Robert Patro,et al.  Reference-based compression of short-read sequences using path encoding , 2015, Bioinform..

[13]  Raffaele Giancarlo,et al.  Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies , 2014, Briefings Bioinform..

[14]  Sebastián Isaza,et al.  Performance comparison of sequential and parallel compression applications for DNA raw data , 2016, The Journal of Supercomputing.

[15]  Amr A. Sharawi,et al.  DNA Lossless Compression Algorithms: Review , 2013 .

[16]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[17]  Sara P. Garcia,et al.  GReEn: a tool for efficient compression of genome resequencing data , 2011, Nucleic acids research.

[18]  Yanli Yang,et al.  CompMap: a reference-based compression program to speed up read mapping to related reference sequences , 2015, Bioinform..

[19]  Ulf Leser,et al.  Sequence Factorization with Multiple References , 2015, PloS one.

[20]  Shuigeng Zhou,et al.  CoGI: Towards Compressing Genomes as an Image , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  Kiyoshi Asai,et al.  Transformations for the compression of FASTQ quality scores of next-generation sequencing data , 2012, Bioinform..

[22]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[23]  Yanli Yang,et al.  Light-weight reference-based compression of FASTQ data , 2015, BMC Bioinformatics.

[24]  Faraz Hach,et al.  DeeZ: reference-based compression by local assembly , 2014, Nature Methods.

[25]  R. Nutter,et al.  Applied Biosystems SOLiD™ System: Ligation‐Based Sequencing , 2008 .

[26]  Sanguthevar Rajasekaran,et al.  LFQC: A lossless compression algorithm for FASTQ files , 2019, Bioinform..

[27]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[28]  Mikel Hernaez,et al.  Effect of lossy compression of quality scores on variant calling , 2015 .

[29]  Hongkai Xiong,et al.  Bi-Directional Context Modeling with Combinatorial Structuring for Genome Sequence Compression , 2015, 2015 Data Compression Conference.

[30]  Yeting Zhang,et al.  A FASTQ compressor based on integer-mapped k-mer indexing for biologist. , 2016, Gene.

[31]  Zhen Ji,et al.  Compression of next-generation sequencing quality scores using memetic algorithm , 2014, BMC Bioinformatics.

[32]  Giovanna Rosone,et al.  Adaptive reference-free compression of sequence quality scores , 2014, Bioinform..

[33]  Bonnie Berger,et al.  Quality score compression improves genotyping accuracy , 2015, Nature Biotechnology.