SDC Resilient Error-bounded Lossy Compressor

Lossy compression is one of the most important strategies to resolve the big science data issue, however, little work was done to make it resilient against silent data corruptions (SDC). In fact, SDC is becoming non-negligible because of exa-scale computing demand on complex scientific simulations with vast volume of data being produced or in some particular instruments/devices (such as interplanetary space probe) that need to transfer large amount of data in an error-prone environment. In this paper, we propose an SDC resilient error-bounded lossy compressor upon the SZ compression framework. Specifically, we adopt a new independent-block-wise model that decomposes the entire dataset into many independent sub-blocks to compress. Then, we design and implement a series of error detection/correction strategies based on SZ. We are the first to extend algorithm-based fault tolerance (ABFT) to lossy compression. Our proposed solution incurs negligible execution overhead without soft errors. It keeps the correctness of decompressed data still bounded within user's requirement with a very limited degradation of compression ratios upon soft errors.

[1]  Cyrille Artho,et al.  Using Checkpointing and Virtualization for Fault Injection , 2014, 2014 Second International Symposium on Computing and Networking.

[2]  Adam M. Jacobs Reconfigurable fault tolerance for space systems , 2013 .

[3]  Hal Finkel,et al.  HACC , 2016, Commun. ACM.

[4]  Tanya Vladimirova,et al.  Parallelised fault-tolerant Integer KLT implementation for lossless hyperspectral image compression on board satellites , 2013, 2013 NASA/ESA Conference on Adaptive Hardware and Systems (AHS-2013).

[5]  Satoshi Matsuoka,et al.  Exploration of Lossy Compression for Application-Level Checkpoint/Restart , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[6]  Franck Cappello,et al.  Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[7]  Tong Liu,et al.  Understanding and Modeling Lossy Compression Schemes on HPC Scientific Data , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[8]  S Lim A fault tolerant parallel computing architecture for remote sensing satellites , 2009 .

[9]  Franck Cappello,et al.  Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[10]  Jarek Rossignac,et al.  Out‐of‐core compression and decompression of large n‐dimensional scalar fields , 2003, Comput. Graph. Forum.

[11]  Aviral Shrivastava,et al.  Control Flow Checking or Not? (for Soft Errors) , 2019, ACM Trans. Embed. Comput. Syst..

[12]  S. Slavney,et al.  The planetary data system , 1994 .

[13]  Zizhong Chen,et al.  Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[14]  Seung Woo Son,et al.  NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Peter Lindstrom,et al.  Fixed-Rate Compressed Floating-Point Arrays , 2014, IEEE Transactions on Visualization and Computer Graphics.

[16]  P.Emma,et al.  High Fidelity Start-to-end Numerical Particle Simulations and Performance Studies for LCLS-II , 2015 .

[17]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[18]  Dingwen Tao,et al.  Silent Data Corruption Resilient Two-sided Matrix Factorizations , 2017, PPoPP.

[19]  James Demmel,et al.  Fast Reproducible Floating-Point Summation , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[20]  Mariana Vertenstein,et al.  A methodology for evaluating the impact of data compression on climate simulation data , 2014, HPDC '14.

[21]  Franck Cappello,et al.  Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[22]  Brent Welch POSIX IO extensions for HPC , 2005 .

[23]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[24]  Martin Isenburg,et al.  Fast and Efficient Compression of Floating-Point Data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[25]  Robert Latham,et al.  Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data , 2011, Euro-Par.

[26]  Dingwen Tao,et al.  Correcting soft errors online in fast fourier transform , 2017, SC.

[27]  Thomas E. Fornek,et al.  Advanced Photon Source Upgrade Project preliminary design report , 2017 .