Trends in Genome Compression

Technological advancements in high-throughput sequencing have lead to a tremendous increase in the amount of genomic data produced. With the cost being down to 2,000 USD for a single human genome, sequencing dozens of individuals is a task that is feasible even for smaller project or organizations already today. However, generating the sequence is only one issue; another one is storing, managing, and analyzing it. These tasks become more and more challenging due to the sheer size of the data sets and are increasingly considered to be the most severe bottlenecks in larger genome projects. One possible countermeasure is to compress the data; compression reduces costs in terms of requiring less hard disk storage and in terms of requiring less bandwidth if data is shipped to large compute clusters for parallel analysis. Accordingly, sequence compression has recently attracted much interest in the scientific community. In this paper, we explain the different basic techniques for sequence compression, point to distinctions between different compression tasks (e.g., genome versus read compression), and present a comparison of current approaches and tools. To further stimulate progress in genome compression research, we also identify key challenges for future systems.

[1]  Justin Zobel,et al.  Iterative Dictionary Construction for Compression of Large DNA Data Sets , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  Evangelos Theodoridis,et al.  Compressing biological sequences using self adjusting data structures , 2010, Proceedings of the 10th IEEE International Conference on Information Technology and Applications in Biomedicine.

[3]  Ayumi Shinohara,et al.  A Boyer-Moore Type Algorithm for Compressed Pattern Matching , 2000, CPM.

[4]  Ian H. Witten,et al.  Protein is incompressible , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[5]  Justin Zobel,et al.  Reference Sequence Construction for Relative Compression of Genomes , 2011, SPIRE.

[6]  M. Snir,et al.  Big data, but are we ready? , 2011, Nature Reviews Genetics.

[7]  Lei Chen,et al.  Compressed pattern matching in DNA sequences , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[8]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[9]  Xiaohui Xie,et al.  Data structures and compression algorithms for high-throughput sequencing technologies , 2010, BMC Bioinformatics.

[10]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[11]  James Lowey,et al.  Bioinformatics Applications Note Sequence Analysis G-sqz: Compact Encoding of Genomic Sequence and Quality Data , 2022 .

[12]  Pierre Baldi,et al.  Data structures and compression algorithms for genomic sequence data , 2009, Bioinform..

[13]  George Varghese,et al.  Compressing Genomic Sequence Fragments Using SlimGene , 2010, RECOMB.

[14]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[15]  Kamta Nath Mishra,et al.  An Efficient Horizontal and Vertical Method for Online DNA Sequence Compression , 2010 .

[16]  Armando J. Pinho,et al.  Compressing the Human Genome Using Exclusively Markov Models , 2011, PACBB.

[17]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[18]  Wei-Hsin Chen,et al.  Integrating Human Genome Database into Electronic Health Record with Sequence Alignment and Compression Mechanism , 2011, Journal of Medical Systems.

[19]  Ulf Leser,et al.  Adaptive efficient compression of genomes , 2012, Algorithms for Molecular Biology.

[20]  Kiyoshi Asai,et al.  Transformations for the compression of FASTQ quality scores of next-generation sequencing data , 2012, Bioinform..

[21]  Justin Zobel,et al.  Optimized Relative Lempel-Ziv Compression of Genomes , 2011, ACSC.

[22]  Rajendra Kumar Bharti,et al.  A Biological Sequence Compression Based on Cross Chromosomal Similarities Using Variable length LUT , 2011 .

[23]  Ateet Mehta,et al.  DNA COMPRESSION USING HASH BASED DATA STRUCTURE , 2010 .

[24]  Jannik N. Andersen,et al.  Cancer genomics: from discovery science to personalized medicine , 2011, Nature Medicine.

[25]  Justin Zobel,et al.  Collection-based compression using discovered long matching strings , 2011, CIKM '11.

[26]  Stéphane Grumbach,et al.  A New Challenge for Compression Algorithms: Genetic Sequences , 1994, Inf. Process. Manag..

[27]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.

[28]  Jo Handelsman,et al.  Biotechnological prospects from metagenomics. , 2003, Current opinion in biotechnology.

[29]  Y. M. Kadah,et al.  Genomic Sequences Differential Compression Model , 2010 .

[30]  A. Kasarskis,et al.  A window into third-generation sequencing. , 2010, Human molecular genetics.

[31]  Gonzalo Navarro,et al.  LZ77-Like Compression with Fast Random Access , 2010, 2010 Data Compression Conference.

[32]  S. Golomb Run-length encodings. , 1966 .

[33]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[34]  V. Benci,et al.  Data compression and genomes: a two-dimensional life domain map. , 2008, Journal of theoretical biology.

[35]  K. G. Srinivasa,et al.  Probabilistic Approach for DNA Compression , 2009 .

[36]  Shmuel Tomi Klein,et al.  On the Usefulness of Fibonacci Compression Codes , 2010, Comput. J..

[37]  Ulf Leser,et al.  Data Management Challenges in Next Generation Sequencing , 2012, Datenbank-Spektrum.

[38]  Rangavittal Narayanan,et al.  No-Reference Compression of Genomic Data Stored in FASTQ Format , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[39]  Ulf Leser,et al.  String Searching in Referentially Compressed Genomes , 2012, KDIR.

[40]  Hyoung Do Kim,et al.  DNA Data Compression Based on the Whole Genome Sequence , 2009, J. Convergence Inf. Technol..

[41]  C. Chothia,et al.  Currents in Computational Molecular Biology , 2000 .

[42]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[43]  Szymon Grabowski,et al.  Engineering Relative Compression of Genomes , 2011, ArXiv.

[44]  Allam Apparao,et al.  DNABIT Compress – Genome compression algorithm , 2011, Bioinformation.

[45]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[46]  Heba Afify,et al.  DNA Lossless Differential Compression Algorithm based on Similarity of Genomic Sequence Database , 2011, ArXiv.

[47]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[48]  Alistair Moffat,et al.  Huffman Coding , 2019, Encyclopedia of Algorithms.

[49]  C. Nusbaum,et al.  Quality scores and SNP detection in sequencing-by-synthesis systems. , 2008, Genome research.

[50]  Rangavittal Narayanan,et al.  Algorithm for DNA sequence compression based on prediction of mismatch bases and repeat location , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[51]  I. Tabus,et al.  Protein Is Compressible , 2004, Proceedings of the 6th Nordic Signal Processing Symposium, 2004. NORSIG 2004..

[52]  Shmuel Tomi Klein,et al.  Is Huffman coding dead? , 1993, Computing.

[53]  Christopher Gignoux,et al.  The 1000 Genomes Project: new opportunities for research and social challenges , 2010, Genome Medicine.

[54]  Rangavittal Narayanan,et al.  System for random access DNA sequence compression , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[55]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[56]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[57]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[58]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[59]  Ioan Tabus,et al.  Genome compression using normalized maximum likelihood models for constrained Markov sources , 2008, 2008 IEEE Information Theory Workshop.

[60]  Gregory Vey Differential direct coding: a compression algorithm for nucleotide sequence data , 2009, Database J. Biol. Databases Curation.

[61]  Zhen Ji,et al.  DNA Sequence Compression Using Adaptive Particle Swarm Optimization-Based Memetic Algorithm , 2011, IEEE Transactions on Evolutionary Computation.

[62]  Joan L. Mitchell,et al.  JPEG: Still Image Data Compression Standard , 1992 .

[63]  Congmao Wang,et al.  A novel compression tool for efficient storage of genome resequencing data , 2011, Nucleic acids research.

[64]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[65]  Xiaohui Xie,et al.  Sequence analysis Human genomes as email attachments , 2022 .

[66]  Pragya Pande Compressing the Human Genome against a reference , 2011 .

[67]  P. Raja Rajeswari,et al.  Genbit Compress Tool(GBC): A Java-Based Tool to Compress DNA Sequences and Compute Compression Ratio(bits/base) of Genomes , 2010, ArXiv.

[68]  Sara P. Garcia,et al.  GReEn: a tool for efficient compression of genome resequencing data , 2011, Nucleic acids research.

[69]  Lixia Zhang,et al.  Compressed Pattern Matching in DNA Sequences Using Multithreaded Technology , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.