Compression of short-read sequences using path encoding

Storing, transmitting, and archiving the amount of data produced by next generation sequencing is becoming a significant computational burden. For example, large-scale RNA-seq meta-analyses may now routinely process tens of terabytes of sequence. We present here an approach to biological sequence compression that reduces the difficulty associated with managing the data produced by large-scale transcriptome sequencing. Our approach offers a new direction by sitting between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs — a common task in genome assembly — and context-dependent arithmetic coding. Supporting this method is a system, called a bit tree, to compactly store sets of kmers that is of independent interest. Using these techniques, we are able to encode RNA-seq reads using 3% – 11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than recent competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.

[1]  Alistair Moffat,et al.  Lossy compression of quality scores in genomic data , 2014, Bioinform..

[2]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[3]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[4]  Lenore Cowen,et al.  Compressive genomics for protein databases , 2013, Bioinform..

[5]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[6]  W. D. Jonge,et al.  S + -trees: an efficient structure for the representation of large pictures , 1994 .

[7]  Matthew S. Burriesci,et al.  Fulcrum: condensing redundant reads from high-throughput sequencing studies , 2012, Bioinform..

[8]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[10]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[11]  Yi Xing,et al.  Transcriptome landscape of the human placenta , 2012, BMC Genomics.

[12]  Rangavittal Narayanan,et al.  No-Reference Compression of Genomic Data Stored in FASTQ Format , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[13]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[14]  Toshiko Matsumoto,et al.  Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[15]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[16]  Raymond Lo,et al.  Pseudomonas Genome Database: improved comparative analysis and population genomics capability for Pseudomonas genomes , 2010, Nucleic Acids Res..

[17]  N. Popitsch,et al.  NGC: lossless and lossy compression of aligned high-throughput sequencing data , 2012, Nucleic acids research.

[18]  Pierre Baldi,et al.  Data structures and compression algorithms for genomic sequence data , 2009, Bioinform..

[19]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[20]  Ole Schulz-Trieglaff,et al.  BEETL-fastq: a searchable compressed archive for DNA reads , 2014, Bioinform..

[21]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[22]  B. Berger,et al.  Compressive genomics , 2012, Nature Biotechnology.

[23]  Jihoon Kim,et al.  HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads , 2013, J. Am. Medical Informatics Assoc..

[24]  George Varghese,et al.  Compressing Genomic Sequence Fragments Using SlimGene , 2010, RECOMB.

[25]  Faraz Hach,et al.  SCALCE: boosting sequence compression algorithms using locally consistent encoding , 2012, Bioinform..

[26]  Idoia Ochoa,et al.  QualComp: a new lossy compressor for quality scores based on rate distortion theory , 2013, BMC Bioinformatics.

[27]  S. Horvath,et al.  Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing , 2013, Nature.

[28]  James Lowey,et al.  Bioinformatics Applications Note Sequence Analysis G-sqz: Compact Encoding of Genomic Sequence and Quality Data , 2022 .

[29]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[30]  Allam Apparao,et al.  DNABIT Compress – Genome compression algorithm , 2011, Bioinformation.

[31]  Christian Steinruecken Compressing Sets and Multisets of Sequences , 2015, IEEE Transactions on Information Theory.

[32]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[33]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[34]  L. Pachter,et al.  Streaming fragment assignment for real-time analysis of sequencing experiments , 2012, Nature Methods.

[35]  Bonnie Berger,et al.  Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification , 2014, RECOMB.

[36]  Richard E. Ladner,et al.  Grammar-based Compression of DNA Sequences , 2007 .

[37]  James T. Robinson,et al.  Compression of Structured High-Throughput Sequencing Data , 2013, PloS one.

[38]  Ruiqiang Li,et al.  Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells , 2013, Nature Structural &Molecular Biology.

[39]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[40]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[41]  Yong Zhang,et al.  DNA sequence compression using the Burrows-Wheeler Transform , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[42]  Glen G. Langdon,et al.  Arithmetic Coding , 1979 .

[43]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.