GeneCodeq: quality score compression and improved genotyping using a Bayesian framework

MOTIVATION The exponential reduction in cost of genome sequencing has resulted in a rapid growth of genomic data. Most of the entropy of short read data lies not in the sequence of read bases themselves but in their Quality Scores-the confidence measurement that each base has been sequenced correctly. Lossless compression methods are now close to their theoretical limits and hence there is a need for lossy methods that further reduce the complexity of these data without impacting downstream analyses. RESULTS We here propose GeneCodeq, a Bayesian method inspired by coding theory for adjusting quality scores to improve the compressibility of quality scores without adversely impacting genotyping accuracy. Our model leverages a corpus of k-mers to reduce the entropy of the quality scores and thereby the compressibility of these data (in FASTQ or SAM/BAM/CRAM files), resulting in compression ratios that significantly exceeds those of other methods. Our approach can also be combined with existing lossy compression schemes to further reduce entropy and allows the user to specify a reference panel of expected sequence variations to improve the model accuracy. In addition to extensive empirical evaluation, we also derive novel theoretical insights that explain the empirical performance and pitfalls of corpus-based quality score compression schemes in general. Finally, we show that as a positive side effect of compression, the model can lead to improved genotyping accuracy. AVAILABILITY AND IMPLEMENTATION GeneCodeq is available at: github.com/genecodeq/eval CONTACT: dan@petagene.comSupplementary information: Supplementary data are available at Bioinformatics online.

[1]  Russ B. Altman,et al.  Bioinformatics challenges for personalized medicine , 2011, Bioinform..

[2]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[3]  Muin J Khoury,et al.  Deploying whole genome sequencing in clinical practice and public health: Meeting the challenge one bin at a time , 2011, Genetics in Medicine.

[4]  Szymon Grabowski,et al.  Disk-based compression of data from genome sequencing , 2015, Bioinform..

[5]  Idoia Ochoa,et al.  QualComp: a new lossy compressor for quality scores based on rate distortion theory , 2013, BMC Bioinformatics.

[6]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[7]  Bonnie Berger,et al.  Quality score compression improves genotyping accuracy , 2015, Nature Biotechnology.

[8]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[9]  Monya Baker,et al.  Next-generation sequencing: adjusting to data overload , 2010, Nature Methods.

[10]  James K. Bonfield,et al.  Compression of FASTQ and SAM Format Sequencing Data , 2013, PloS one.

[11]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[12]  Mona Singh,et al.  Computational solutions for omics data , 2013, Nature Reviews Genetics.

[13]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[14]  Alexander Barg,et al.  At the Dawn of the Theory of Codes , 1993 .

[15]  Giovanna Rosone,et al.  Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform , 2012, Bioinform..

[16]  Dominique Lavenier,et al.  Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph , 2015, BMC Bioinformatics.

[17]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[18]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[19]  Haris Vikalo,et al.  OnlineCall: fast online parameter estimation and base calling for illumina's next-generation sequencing , 2012, Bioinform..

[20]  Bonnie Berger,et al.  Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification , 2014, RECOMB.

[21]  L. Goddard Information Theory , 1962, Nature.

[22]  Alistair Moffat,et al.  Lossy compression of quality scores in genomic data , 2014, Bioinform..