Multi-Sided Compression Performance Assessment of ABI SOLiD WES Data

Data storage is a major and growing part of IT budgets for research since manyyears. Especially in biology, the amount of raw data products is growing continuously,and the advent of the so-called "next-generation" sequencers has made things worse.Affordable prices have pushed scientists to massively sequence whole genomes and to screenlarge cohort of patients, thereby producing tons of data as a side effect. The need formaximally fitting data into the available storage volumes has encouraged and welcomednew compression algorithms and tools. We focus here on state-of-the-art compression toolsand measure their compression performance on ABI SOLiD data.

[1]  Adam Kiezun,et al.  Computational and statistical approaches to analyzing variants identified by exome sequencing , 2011, Genome Biology.

[2]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[3]  S. Golomb Run-length encodings. , 1966 .

[4]  Tommaso Mazza,et al.  A solid quality-control analysis of AB SOLiD short-read sequencing data , 2013, Briefings Bioinform..

[5]  Tommaso Mazza,et al.  Congruency in the prediction of pathogenic missense mutations: state-of-the-art web-based tools , 2013, Briefings Bioinform..

[6]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[7]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[8]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[9]  N. Popitsch,et al.  NGC: lossless and lossy compression of aligned high-throughput sequencing data , 2012, Nucleic acids research.

[10]  Kai Wang,et al.  wANNOVAR: annotating genetic variants for personal genomes via the web , 2012, Journal of Medical Genetics.

[11]  David Haussler,et al.  ENCODE whole-genome data in the UCSC genome browser (2011 update) , 2010, Nucleic Acids Res..

[12]  Walter L. Ruzzo,et al.  Compression of next-generation sequencing reads aided by highly efficient de novo assembly , 2012, Nucleic acids research.

[13]  S. Gabriel,et al.  Analysis of 6,515 exomes reveals a recent origin of most human protein-coding variants , 2012, Nature.

[14]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[15]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.