论文信息 - Context binning, model clustering and adaptivity for data compression of genetic data

Context binning, model clustering and adaptivity for data compression of genetic data

Rapid growth of genetic databases means huge savings from improvements in their data compression, what requires better inexpensive statistical models. This article proposes automatized optimizations e.g. of Markov-like models, especially context binning and model clustering. While it is popular to cut low bits of context, proposed context binning optimizes such reduction as tabled: state=bin[context] determining probability distribution, this way extracting nearly all useful information also from very large contexts, into a small number of states. Model clustering uses kmeans clustering in space of general statistical models, allowing to optimize a few models (as cluster centroids) to be chosen e.g. separately for each read. There are also briefly discussed some adaptivity techniques to include data non-stationarity. This article is work in progress, to be expanded in the future.

Jarek Duda

[1] D. J. Wheeler,et al. A Block-sorting Lossless Data Compression Algorithm , 1994 .

[2] P. Gonzalez-Alegre,et al. Towards precision medicine , 2017 .

[3] Jarek Duda. Parametric context adaptive Laplace distribution for multimedia compression , 2019, ArXiv.

[4] CRAM 3.1: Advances in the CRAM File Format , 2021 .

[5] Manfred K. Warmuth,et al. Randomized Online PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension , 2008 .

[6] Jürgen Schmidhuber,et al. Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[7] J. Duda. Adaptive exponential power distribution with moving estimator for nonstationary time series , 2020, 2003.02149.

[8] Edward J. Delp,et al. The use of asymmetric numeral systems as an accurate replacement for Huffman coding , 2015, 2015 Picture Coding Symposium (PCS).

[9] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[10] Glen G. Langdon,et al. Arithmetic Coding , 1979 .