A Feature-Driven Fixed-Ratio Lossy Compression Framework for Real-World Scientific Datasets

Today’s scientific applications and advanced instruments are producing extremely large volumes of data everyday, so that error-controlled lossy compression has become a critical technique to the scientific data storage and management. Existing lossy scientific data compressors, however, are designed mainly based on error-control driven mechanism, which cannot be efficiently applied in the fixed-ratio use-case, where a desired compression ratio needs to be reached because of the restricted data processing/management resources such as limited memory/storage capacity and network bandwidth. To address this gap, we propose a low-cost compressor-agnostic feature-driven fixed-ratio lossy compression framework (FXRZ). The key contributions are three-fold. (1) We perform an in-depth analysis of the correlation between diverse data features and compression ratios based on a wide range of application datasets, which is a fundamental work for our framework. (2) We propose a series of optimization strategies that can enable the framework to reach a fairly high accuracy in identifying the expected error configuration with very low computational cost. (3) We comprehensively evaluate our framework using 4 state-of-the-art error-controlled lossy compressors on 10 different snapshots and simulation configuration-based real-world scientific datasets from 4 different applications across different domains. Our experiment shows that FXRZ outperforms the state-of-the-art related work by 108×. The experiments with 4,096 cores on a supercomputer show a performance gain of 1.18∼8.71× than the related work in overall parallel data dumping.

[1]  K. Chard,et al.  Optimizing Multi-Range based Error-Bounded Lossy Compression for Scientific Datasets , 2021, 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC).

[2]  S. Byna,et al.  Improving Prediction-Based Lossy Compression Dramatically via Ratio-Quality Modeling , 2021, 2022 IEEE 38th International Conference on Data Engineering (ICDE).

[3]  Franck Cappello,et al.  Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs , 2021, 2021 IEEE International Conference on Cluster Computing (CLUSTER).

[4]  Franck Cappello,et al.  Optimizing Error-Bounded Lossy Compression for Scientific Data by Dynamic Spline Interpolation , 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[5]  Franck Cappello,et al.  SDRBench: Scientific Data Reduction Benchmark for Lossy Compressors , 2020, 2020 IEEE International Conference on Big Data (Big Data).

[6]  Scott Klasky,et al.  MGARD+: Optimizing Multilevel Methods for Error-Bounded Scientific Data Reduction , 2020, IEEE Transactions on Computers.

[7]  Keichi Takahashi,et al.  ADIOS 2: The Adaptable Input Output System. A framework for high-performance data management , 2020, SoftwareX.

[8]  Gerhard Nahler,et al.  Pearson Correlation Coefficient , 2020, Definitions.

[9]  Franck Cappello,et al.  FRaZ: A Generic High-Fidelity Fixed-Ratio Lossy Compression Framework for Scientific Floating-point Data , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[10]  Franck Cappello,et al.  Significantly improving lossy compression quality based on an optimized hybrid prediction model , 2019, SC.

[11]  Emma Maitreyee Dasgupta,et al.  Full-state quantum circuit simulation by using data compression , 2019, SC.

[12]  Franck Cappello,et al.  Improving Performance of Data Dumping with Lossy Compression for Scientific Simulation , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  Franck Cappello,et al.  Use cases of lossy compression for floating-point data in scientific data sets , 2019, Int. J. High Perform. Comput. Appl..

[14]  Franck Cappello,et al.  Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[15]  Scott Klasky,et al.  Multilevel techniques for compression and reduction of scientific data—the univariate case , 2018, Comput. Vis. Sci..

[16]  Franck Cappello,et al.  Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP , 2018, IEEE Transactions on Parallel and Distributed Systems.

[17]  Tong Liu,et al.  Understanding and Modeling Lossy Compression Schemes on HPC Scientific Data , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[18]  Hua-wei Zhou,et al.  Reverse time migration: A prospect of seismic imaging methodology , 2018 .

[19]  Thomas E. Fornek,et al.  Advanced Photon Source Upgrade Project preliminary design report , 2017 .

[20]  Franck Cappello,et al.  Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[21]  Franck Cappello,et al.  Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[22]  Peter Lindstrom,et al.  Fixed-Rate Compressed Floating-Point Arrays , 2014, IEEE Transactions on Visualization and Computer Graphics.

[23]  Hal Finkel,et al.  HACC , 2016, Commun. ACM.

[24]  Jean M. Sexton,et al.  Nyx: A MASSIVELY PARALLEL AMR CODE FOR COMPUTATIONAL COSMOLOGY , 2013, J. Open Source Softw..

[25]  Martin L. Kersten,et al.  The researcher's guide to the data deluge , 2011, Proc. VLDB Endow..

[26]  Martin Isenburg,et al.  Fast and Efficient Compression of Floating-Point Data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[27]  David J. DeWitt,et al.  Scientific data management in the coming decade , 2005, SGMD.

[28]  L. Breiman Random Forests , 2001, Encyclopedia of Machine Learning and Data Mining.

[29]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[30]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[31]  Russ Rew,et al.  NetCDF: an interface for scientific data access , 1990, IEEE Computer Graphics and Applications.