Exploration of Pattern-Matching Techniques for Lossy Compression on Cosmology Simulation Data Sets

Because of the vast volume of data being produced by today's scientific simulations, lossy compression allowing user-controlled information loss can significantly reduce the data size and the I/O burden. However, for large-scale cosmology simulation, such as the Hardware/Hybrid Accelerated Cosmology Code (HACC), where memory overhead constraints restrict compression to only one snapshot at a time, the lossy compression ratio is extremely limited because of the fairly low spatial coherence and high irregularity of the data. In this work, we propose a pattern-matching (similarity searching) technique to optimize the prediction accuracy and compression ratio of SZ lossy compressor on the HACC data sets. We evaluate our proposed method with different configurations and compare it with state-of-the-art lossy compressors. Experiments show that our proposed optimization approach can improve the prediction accuracy and reduce the compressed size of quantization codes compared with SZ. We present several lessons useful for future research involving pattern-matching techniques for lossy compression.

[1]  Ingrid Daubechies,et al.  The wavelet transform, time-frequency localization and signal analysis , 1990, IEEE Trans. Inf. Theory.

[2]  Charles A Laughton,et al.  Essential Dynamics:  A Tool for Efficient Trajectory Compression and Management. , 2006, Journal of chemical theory and computation.

[3]  Peter Lindstrom,et al.  Fixed-Rate Compressed Floating-Point Arrays , 2014, IEEE Transactions on Visualization and Computer Graphics.

[4]  Dow-Yung Yang,et al.  Bounded-Error Compression of Particle Data from Hierarchical Approximate Methods , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[5]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[6]  Martin Burtscher,et al.  Fast lossless compression of scientific floating-point data , 2006, Data Compression Conference (DCC'06).

[7]  J. Chanussot,et al.  Total ordering based on space filling curves for multivalued morphology , 1998 .

[8]  Martin Isenburg,et al.  Fast and Efficient Compression of Floating-Point Data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[9]  Franck Cappello,et al.  Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[10]  Franck Cappello,et al.  Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11]  Jianqin Zhou,et al.  On discrete cosine transform , 2011, ArXiv.

[12]  Seung Woo Son,et al.  NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  N. Ahmed,et al.  Discrete Cosine Transform , 1996 .

[14]  Mariana Vertenstein,et al.  A methodology for evaluating the impact of data compression on climate simulation data , 2014, HPDC '14.

[15]  Chris E. Forest,et al.  Industrial-era global ocean heat uptake doubles in recent decades , 2016 .

[16]  Arie Shoshani,et al.  The Earth System Grid: Supporting the Next Generation of Climate Modeling Research , 2005, Proceedings of the IEEE.

[17]  Robert Latham,et al.  ISABELA for effective in situ compression of scientific data , 2013, Concurr. Comput. Pract. Exp..

[18]  Hal Finkel,et al.  HACC: Simulating Sky Surveys on State-of-the-Art Supercomputing Architectures , 2014, 1410.2805.

[19]  Anand Kumar,et al.  Compression in Molecular Simulation Datasets , 2013, IScIDE.

[20]  Rajiv K. Kalia,et al.  Scalable I/O of large-scale molecular dynamics simulations: A data-compression algorithm , 2000 .

[21]  Satoshi Matsuoka,et al.  Exploration of Lossy Compression for Application-Level Checkpoint/Restart , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[22]  Peter Deutsch,et al.  GZIP file format specification version 4.3 , 1996, RFC.

[23]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).