Better quality score compression through sequence-based quality smoothing

Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling. We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy. We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources. https://github.com/yhhshb/yalff

[1]  Idoia Ochoa,et al.  QualComp: a new lossy compressor for quality scores based on rate distortion theory , 2013, BMC Bioinformatics.

[2]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[3]  Bonnie Berger,et al.  Quality score compression improves genotyping accuracy , 2015, Nature Biotechnology.

[4]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[5]  Matteo Comin,et al.  Clustering of reads with alignment-free measures and quality values , 2014, Algorithms for Molecular Biology.

[6]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[7]  Matteo Comin,et al.  Indexing k-mers in Linear-space for Quality Value Compression , 2019, BIOINFORMATICS.

[8]  Faraz Hach,et al.  Dynamic Alignment-Free and Reference-Free Read Compression , 2018, J. Comput. Biol..

[9]  Matteo Comin,et al.  Fast and Sensitive Classification of Short Metagenomic Reads with SKraken , 2017, BIOSTEC.

[10]  Matteo Comin,et al.  Fast comparison of genomic and meta-genomic reads with alignment-free measures based on quality values , 2016, BMC Medical Genomics.

[11]  Alistair Moffat,et al.  Lossy compression of quality scores in genomic data , 2014, Bioinform..

[12]  Cinzia Pizzi,et al.  Higher recall in metagenomic sequence classification exploiting overlapping reads , 2016, BMC Genomics.

[13]  Oliver Stegle,et al.  GeneCodeq: quality score compression and improved genotyping using a Bayesian framework , 2016, Bioinform..

[14]  Giovanna Rosone,et al.  Adaptive reference-free compression of sequence quality scores , 2014, Bioinform..

[15]  M. Nourani,et al.  Single and multi-subject clustering of flow cytometry data for cell-type identification and anomaly detection , 2016, BMC Medical Genomics.

[16]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[17]  Matteo Comin,et al.  Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns , 2014, BMC Bioinformatics.

[18]  Bonnie Berger,et al.  Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification , 2014, RECOMB.

[19]  Mikel Hernaez,et al.  FaStore: a space-saving solution for raw sequencing data , 2018, Bioinform..

[20]  Meng He,et al.  Indexing Compressed Text , 2003 .

[21]  Matteo Comin,et al.  QCluster: Extending Alignment-Free Measures with Quality Values for Reads Clustering , 2014, WABI.

[22]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[23]  Dominique Lavenier,et al.  Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph , 2015, BMC Bioinformatics.

[24]  Mikel Hernaez,et al.  Effect of lossy compression of quality scores on variant calling , 2015, bioRxiv.

[25]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[26]  Szymon Grabowski,et al.  Disk-based compression of data from genome sequencing , 2015, Bioinform..

[27]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[28]  Alexander Bockmayr,et al.  Double and multiple knockout simulations for genome-scale metabolic network reconstructions , 2015, Algorithms for Molecular Biology.

[29]  Matteo Comin,et al.  Beyond Fixed-Resolution Alignment-Free Measures for Mammalian Enhancers Sequence Comparison , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Faraz Hach,et al.  SCALCE: boosting sequence compression algorithms using locally consistent encoding , 2012, Bioinform..