Reference-based compression of short-read sequences using path encoding

Motivation: Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. Results: We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3–11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved. Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X. Contact: carlk@cs.cmu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[2]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[3]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[4]  Lenore Cowen,et al.  Compressive genomics for protein databases , 2013, Bioinform..

[5]  Yong Zhang,et al.  DNA sequence compression using the Burrows-Wheeler Transform , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[6]  Alistair Moffat,et al.  Lossy compression of quality scores in genomic data , 2014, Bioinform..

[7]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[8]  Bonnie Berger,et al.  Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification , 2014, RECOMB.

[9]  James T. Robinson,et al.  Compression of Structured High-Throughput Sequencing Data , 2013, PloS one.

[10]  Glen G. Langdon,et al.  Arithmetic Coding , 1979 .

[11]  Mike Gleicher,et al.  Lossy Compression , 2020, Encyclopedia of Machine Learning and Data Mining.

[12]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[13]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[14]  Jihoon Kim,et al.  HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads , 2013, J. Am. Medical Informatics Assoc..

[15]  L. Pachter,et al.  Streaming fragment assignment for real-time analysis of sequencing experiments , 2012, Nature Methods.

[16]  Ruiqiang Li,et al.  Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells , 2013, Nature Structural &Molecular Biology.

[17]  Matthew S. Burriesci,et al.  Fulcrum: condensing redundant reads from high-throughput sequencing studies , 2012, Bioinform..

[18]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[19]  B. Berger,et al.  Compressive genomics , 2012, Nature Biotechnology.

[20]  Faraz Hach,et al.  SCALCE: boosting sequence compression algorithms using locally consistent encoding , 2012, Bioinform..

[21]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Helen M. Rowe,et al.  Loss of transcriptional control over endogenous retroelements during reprogramming to pluripotency , 2014, Genome research.

[23]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[24]  Richard E. Ladner,et al.  Grammar-based Compression of DNA Sequences , 2007 .

[25]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[26]  Ole Schulz-Trieglaff,et al.  BEETL-fastq: a searchable compressed archive for DNA reads , 2014, Bioinform..

[27]  Christian Steinruecken Compressing Sets and Multisets of Sequences , 2015, IEEE Transactions on Information Theory.

[28]  Pierre Baldi,et al.  Data structures and compression algorithms for genomic sequence data , 2009, Bioinform..

[29]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[30]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[31]  Yi Xing,et al.  Transcriptome landscape of the human placenta , 2012, BMC Genomics.

[32]  Rangavittal Narayanan,et al.  No-Reference Compression of Genomic Data Stored in FASTQ Format , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[33]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[34]  Toshiko Matsumoto,et al.  Biological sequence compression algorithms. , 2000, Genome informatics. Workshop on Genome Informatics.

[35]  Szymon Grabowski,et al.  Compression of DNA sequence reads in FASTQ format , 2011, Bioinform..

[36]  Raymond Lo,et al.  Pseudomonas Genome Database: improved comparative analysis and population genomics capability for Pseudomonas genomes , 2010, Nucleic Acids Res..

[37]  N. Popitsch,et al.  NGC: lossless and lossy compression of aligned high-throughput sequencing data , 2012, Nucleic acids research.

[38]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[39]  Allam Apparao,et al.  DNABIT Compress – Genome compression algorithm , 2011, Bioinformation.

[40]  Idoia Ochoa,et al.  QualComp: a new lossy compressor for quality scores based on rate distortion theory , 2013, BMC Bioinformatics.

[41]  S. Horvath,et al.  Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing , 2013, Nature.

[42]  James Lowey,et al.  G-SQZ: compact encoding of genomic sequence and quality data , 2010, Bioinform..

[43]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[44]  James K. Bonfield,et al.  The Scramble conversion tool , 2014, bioRxiv.

[45]  W. D. Jonge,et al.  S + -trees: an efficient structure for the representation of large pictures , 1994 .

[46]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[47]  Alan R. Earls,et al.  Digital equipment corporation. , 2004, Analytical chemistry.

[48]  George Varghese,et al.  Compressing Genomic Sequence Fragments Using SlimGene , 2010, RECOMB.