Efficient storage of high throughput DNA sequencing data using reference-based compression.

Data storage costs have become an appreciable proportion of total cost in the creation and analysis of DNA sequence data. Of particular concern is that the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity. In this paper we present a new reference-based compression method that efficiently compresses DNA sequences for storage. Our approach works for resequencing experiments that target well-studied genomes. We align new sequences to a reference genome and then encode the differences between the new sequence and the reference genome for storage. Our compression method is most efficient when we allow controlled loss of data in the saving of quality information and unaligned sequences. With this new compression method we observe exponential efficiency gains as read lengths increase, and the magnitude of this efficiency gain can be controlled by changing the amount of quality information stored. Our compression method is tunable: The storage of quality scores and unaligned sequences may be adjusted for different experiments to conserve information or to minimize storage costs, and provides one opportunity to address the threat that increasing DNA sequence volumes will overcome our ability to store the sequences.

[1]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[2]  S. Golomb Run-length encodings. , 1966 .

[3]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[4]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[5]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[6]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[7]  Toshiko Matsumoto,et al.  Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[8]  Bin Ma,et al.  DNACompress: fast and effective DNA sequence compression , 2002, Bioinform..

[9]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[10]  C. Koch,et al.  Attention-driven discrete sampling of motion perception. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[11]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[12]  Joan Stephenson Cancer Genome Consortium , 2008 .

[13]  Xiaohui Xie,et al.  Sequence analysis Human genomes as email attachments , 2022 .

[14]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[15]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[16]  Xiaohui Xie,et al.  Data structures and compression algorithms for high-throughput sequencing technologies , 2009, BMC Bioinformatics.

[17]  Ying Cheng,et al.  Improvements to services at the European Nucleotide Archive , 2009, Nucleic Acids Res..

[18]  Hideaki Sugawara,et al.  Archiving next generation sequencing data , 2009, Nucleic Acids Res..

[19]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[20]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[21]  Takashi Gojobori,et al.  DDBJ launches a new archive database with analytical tools for next-generation sequence data , 2009, Nucleic Acids Res..

[22]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.