论文信息 - Compression for population genetic data through finite-state entropy

Compression for population genetic data through finite-state entropy

We improve the efficiency of population genetic file formats and GWAS computation by leveraging the distribution of sample ordering in population-level genetic data. We identify conditional exchangeability of these data, recommending finite state entropy algorithms as an arithmetic code naturally suited to population genetic data. We show between 10% and 40% speed and size improvements over dictionary compression methods for population genetic data such as Zstd and Zlib in computation and and decompression tasks. We provide a prototype for genome-wide association study with finite state entropy compression demonstrating significant space saving and speed comparable to the state-of-the-art.

Lloyd T. Elliott | Winfield Chen | L. T. Elliott | Winfield Chen

[1] P. Donnelly,et al. A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[2] Ian Holmes,et al. Modular non-repeating codes for DNA storage , 2016, bioRxiv.

[3] P. Visscher,et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[4] Yee Whye Teh,et al. Bayesian Nonparametric Models , 2010, Encyclopedia of Machine Learning.

[5] Jinko Graham,et al. LDheatmap: An R Function for Graphical Display of Pairwise Linkage Disequilibria Between Single Nucleotide Polymorphisms , 2006 .

[6] Peter Donnelly,et al. HAPGEN2: simulation of multiple disease SNPs , 2011, Bioinform..

[7] Yong Zhang,et al. DNA sequence compression using the Burrows-Wheeler Transform , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[8] C. E. SHANNON,et al. A mathematical theory of communication , 1948, MOCO.

[9] Jarek Duda,et al. Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding , 2013, 1311.2540.

[10] Mark Adler,et al. zlib compression library , 2012 .

[11] Alexander Sweeten,et al. Accurate alignment-free inference of microbial phylogenies , 2019 .

[12] Gavin Band,et al. BGEN: a binary file format for imputed genotype and haplotype data , 2018, bioRxiv.

[13] Carson C Chow,et al. Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[14] Gilean McVean,et al. Inferring the ancestry of everyone , 2018, bioRxiv.

[15] D. Reich,et al. Population Structure and Eigenanalysis , 2006, PLoS genetics.

[16] P. Donnelly,et al. The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[17] Khalid Sayood,et al. Lossless Image Compression , 2012 .

[18] Khalid Sayood. Lossless Compression Handbook , 2003 .

[19] Simon C. Potter,et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[20] Richard R. Hudson,et al. Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[21] Gabor T. Marth,et al. A global reference for human genetic variation , 2015, Nature.

[22] Jinyan Li,et al. High‐speed and high‐ratio referential genome compression , 2017, Bioinform..

[23] D. Huffman. A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[24] Tadashi Imanishi,et al. Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences , 2018, bioRxiv.