Reference-free compression of next-generation sequencing data in FASTQ format

In this paper, we present a new reference-free and lossless approach to compress next-generation sequencing (NGS) data in FASTQ format, splitting the input FASTQ data into three parts of metadata, short reads and quality scores, and eliminating their redundancy independently according to their own characteristics. Experiments were conducted on five real-world NGS data. The results show that the proposed algorithm has better compression gain as compared to the previous state of the art compression algorithms.

[1]  Justin Zobel,et al.  Iterative Dictionary Construction for Compression of Large DNA Data Sets , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Zhen Ji,et al.  DNA Sequence Compression Using Adaptive Particle Swarm Optimization-Based Memetic Algorithm , 2011, IEEE Transactions on Evolutionary Computation.

[3]  Xiaohui Xie,et al.  Data structures and compression algorithms for high-throughput sequencing technologies , 2010, BMC Bioinformatics.

[4]  Justin Zobel,et al.  Optimized Relative Lempel-Ziv Compression of Genomes , 2011, ACSC.

[5]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.

[6]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[7]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[8]  Raffaele Giancarlo,et al.  Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies , 2014, Briefings Bioinform..

[9]  Yanli Yang,et al.  CompMap: a reference-based compression program to speed up read mapping to related reference sequences , 2015, Bioinform..

[10]  James Lowey,et al.  Bioinformatics Applications Note Sequence Analysis G-sqz: Compact Encoding of Genomic Sequence and Quality Data , 2022 .

[11]  Zhen Ji,et al.  High-throughput DNA sequence data compression , 2015, Briefings Bioinform..

[12]  Toshiko Matsumoto,et al.  Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[13]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[14]  Vladimir Yanovsky ReCoil - an algorithm for compression of extremely large datasets of dna data , 2010, Algorithms for Molecular Biology.

[15]  N. Popitsch,et al.  NGC: lossless and lossy compression of aligned high-throughput sequencing data , 2012, Nucleic acids research.

[16]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[17]  Anirban Dutta,et al.  BIND – An algorithm for loss-less compression of nucleotide sequence data , 2012, Journal of Biosciences.

[18]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[19]  Faraz Hach,et al.  SCALCE: boosting sequence compression algorithms using locally consistent encoding , 2012, Bioinform..

[20]  Anirban Dutta,et al.  DELIMINATE - a fast and efficient method for loss-less compression of genomic sequences: Sequence analysis , 2012, Bioinform..