Adaptive efficient compression of genomes

AbstractModern high-throughput sequencing technologies are able to generate DNA sequences at an ever increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. However, memory requirements of the current algorithms are high and run times often are slow. In this paper, we propose an adaptive, parallel and highly efficient referential sequence compression method which allows fine-tuning of the trade-off between required memory and compression speed. When using 12 MB of memory, our method is for human genomes on-par with the best previous algorithms in terms of compression ratio (400:1) and compression speed. In contrast, it compresses a complete human genome in just 11 seconds when provided with 9 GB of main memory, which is almost three times faster than the best competitor while using less main memory.

[1]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[2]  Stéphane Grumbach,et al.  Compression of DNA sequences , 1993, [Proceedings] DCC `93: Data Compression Conference.

[3]  David Brown,et al.  Pharmacodynamic Modeling of Anti-Cancer Activity of Tetraiodothyroacetic Acid in a Perfused Cell Culture System , 2011, PLoS Comput. Biol..

[4]  Gonzalo Navarro,et al.  LZ77-Like Compression with Fast Random Access , 2010, 2010 Data Compression Conference.

[5]  Behshad Behzadi,et al.  DNA Compression Challenge Revisited: A Dynamic Programming Approach , 2005, CPM.

[6]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[7]  M. Snir,et al.  Big data, but are we ready? , 2011, Nature Reviews Genetics.

[8]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[9]  Raffaele Giancarlo,et al.  Textual data compression in computational biology: Algorithmic techniques , 2012, Comput. Sci. Rev..

[10]  Stefano Lonardi,et al.  String processing and information retrieval : 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010 : proceedings , 2010, SPIRE 2010.

[11]  Armando J. Pinho,et al.  Compressing the Human Genome Using Exclusively Markov Models , 2011, PACBB.

[12]  Giovanni Manzini,et al.  A simple and fast DNA compressor , 2004, Softw. Pract. Exp..

[13]  Jijun Tang,et al.  Improving Transmission Efficiency of Large Sequence Alignment/Map (SAM) Files , 2011, PloS one.

[14]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[15]  Syed Haider,et al.  International Cancer Genome Consortium Data Portal—a one-stop shop for cancer genomics data , 2011, Database J. Biol. Databases Curation.

[16]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[17]  Evangelos Theodoridis,et al.  Compressing biological sequences using self adjusting data structures , 2010, Proceedings of the 10th IEEE International Conference on Information Technology and Applications in Biomedicine.

[18]  Peter J. Tonellato,et al.  Biomedical Cloud Computing With Amazon Web Services , 2011, PLoS Comput. Biol..

[19]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[20]  Xiaohui Xie,et al.  Sequence analysis Human genomes as email attachments , 2022 .

[21]  Justin Zobel,et al.  Iterative Dictionary Construction for Compression of Large DNA Data Sets , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[23]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.

[24]  A. Kasarskis,et al.  A window into third-generation sequencing. , 2010, Human molecular genetics.

[25]  Szymon Grabowski,et al.  Robust relative compression of genomes with random access , 2011, Bioinform..

[26]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[27]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[28]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[29]  Elizabeth Pennisi,et al.  Human genome 10th anniversary. Will computers crash genomics? , 2011, Science.

[30]  Toshiko Matsumoto,et al.  Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[31]  Szymon Grabowski,et al.  Engineering Relative Compression of Genomes , 2011, ArXiv.

[32]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[33]  Athanasios K. Tsakalidis,et al.  dAUTObase: Mining gems on autoimmune diseases utilizing web visualization technologies , 2010, Proceedings of the 10th IEEE International Conference on Information Technology and Applications in Biomedicine.

[34]  Trevor I. Dix,et al.  A Simple Statistical Algorithm for Biological Sequence Compression , 2007, 2007 Data Compression Conference (DCC'07).

[35]  Richard E. Ladner,et al.  Grammar-based Compression of DNA Sequences , 2007 .

[36]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[37]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[38]  Sandeep Koranne,et al.  Boost C++ Libraries , 2011 .

[39]  Khalid Sayood,et al.  Data Compression Concepts and Algorithms and Their Applications to Bioinformatics , 2009, Entropy.