Compression for population genetic data through finite-state entropy

We improve the efficiency of population genetic file formats and GWAS computation by leveraging the distribution of sample ordering in population-level genetic data. We identify conditional exchangeability of these data, recommending finite state entropy algorithms as an arithmetic code naturally suited to population genetic data. We show between 10% and 40% speed and size improvements over dictionary compression methods for population genetic data such as Zstd and Zlib in computation and and decompression tasks. We provide a prototype for genome-wide association study with finite state entropy compression demonstrating significant space saving and speed comparable to the state-of-the-art.

[1]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[2]  Ian Holmes,et al.  Modular non-repeating codes for DNA storage , 2016, bioRxiv.

[3]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[4]  Yee Whye Teh,et al.  Bayesian Nonparametric Models , 2010, Encyclopedia of Machine Learning.

[5]  Jinko Graham,et al.  LDheatmap: An R Function for Graphical Display of Pairwise Linkage Disequilibria Between Single Nucleotide Polymorphisms , 2006 .

[6]  Peter Donnelly,et al.  HAPGEN2: simulation of multiple disease SNPs , 2011, Bioinform..

[7]  Yong Zhang,et al.  DNA sequence compression using the Burrows-Wheeler Transform , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[8]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[9]  Jarek Duda,et al.  Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding , 2013, 1311.2540.

[10]  Mark Adler,et al.  zlib compression library , 2012 .

[11]  Alexander Sweeten,et al.  Accurate alignment-free inference of microbial phylogenies , 2019 .

[12]  Gavin Band,et al.  BGEN: a binary file format for imputed genotype and haplotype data , 2018, bioRxiv.

[13]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[14]  Gilean McVean,et al.  Inferring the ancestry of everyone , 2018, bioRxiv.

[15]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[16]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[17]  Khalid Sayood,et al.  Lossless Image Compression , 2012 .

[18]  Khalid Sayood Lossless Compression Handbook , 2003 .

[19]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[20]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[21]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[22]  Jinyan Li,et al.  High‐speed and high‐ratio referential genome compression , 2017, Bioinform..

[23]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[24]  Tadashi Imanishi,et al.  Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences , 2018, bioRxiv.