论文信息 - waveSZ: a hardware-algorithm co-design of efficient lossy compression for scientific data

waveSZ: a hardware-algorithm co-design of efficient lossy compression for scientific data

Error-bounded lossy compression is critical to the success of extreme-scale scientific research because of ever-increasing volumes of data produced by today's high-performance computing (HPC) applications. Not only can error-controlled lossy compressors significantly reduce the I/O and storage burden but they can retain high data fidelity for post analysis. Existing state-of-the-art lossy compressors, however, generally suffer from relatively low compression and decompression throughput (up to hundreds of megabytes per second on a single CPU core), which considerably restrict the adoption of lossy compression by many HPC applications especially those with a fairly high data production rate. In this paper, we propose a highly efficient lossy compression approach based on field programmable gate arrays (FPGAs) under the state-of-the-art lossy compression model SZ. Our contributions are fourfold. (1) We adopt a wavefront memory layout to alleviate the data dependency during the prediction for higher-dimensional predictors, such as the Lorenzo predictor. (2) We propose a co-design framework named waveSZ based on the wavefront memory layout and the characteristics of SZ algorithm and carefully implement it by using high-level synthesis. (3) We propose a hardware-algorithm co-optimization method to improve the performance. (4) We evaluate our proposed waveSZ on three real-world HPC simulation datasets from the Scientific Data Reduction Benchmarks and compare it with other state-of-the-art methods on both CPUs and FPGAs. Experiments show that our waveSZ can improve SZ's compression throughput by 6.9X ~ 8.7X over the production version running on a state-of-the-art CPU and improve the compression ratio and throughput by 2.1X and 5.8X on average, respectively, compared with the state-of-the-art FPGA design.

[1] Nikolaos G. Bourbakis,et al. An architecture for video compression based on the SCAN algorithm , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[2] Franck Cappello,et al. An Efficient Transformation Scheme for Lossy Data Compression with Point-Wise Relative Error Bound , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[3] Franck Cappello,et al. Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[4] Philippe Coussy,et al. High-Level Synthesis: from Algorithm to Digital Circuit , 2008 .

[5] W. Bishop,et al. FPGA-Based Lossless Data Compression using Huffman and LZ77 Algorithms , 2007, 2007 Canadian Conference on Electrical and Computer Engineering.

[6] Paul Molitor,et al. A pipelined architecture for partitioned DWT based lossy image compression using FPGA's , 2001, FPGA '01.

[7] P. Mininni,et al. Interactive desktop analysis of high resolution simulations: application to turbulent plume dynamics and current sheet formation , 2007 .

[8] Martin C. Herbordt,et al. GhostSZ: A Transparent FPGA-Accelerated Lossy Compression Framework , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[9] Martin Isenburg,et al. Fast and Efficient Compression of Floating-Point Data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[10] Jason Cong,et al. An FPGA-Based BWT Accelerator for Bzip2 Data Compression , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[11] Edward J. McCluskey,et al. A reliable LZ data compressor on reconfigurable coprocessors , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[12] Jarek Rossignac,et al. Out‐of‐core compression and decompression of large n‐dimensional scalar fields , 2003, Comput. Graph. Forum.

[13] Jason Cong,et al. High-Throughput Lossless Compression on Tightly Coupled CPU-FPGA Platforms , 2018, FCCM.

[14] Mohamed S. Abdelfattah,et al. Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL , 2014, IWOCL '14.

[15] Franck Cappello,et al. Use cases of lossy compression for floating-point data in scientific data sets , 2019, Int. J. High Perform. Comput. Appl..

[16] Hal Finkel,et al. HACC , 2016, Commun. ACM.

[17] Peng Deng,et al. PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization , 2015, ICS.

[18] Peter Lindstrom,et al. Fixed-Rate Compressed Floating-Point Arrays , 2014, IEEE Transactions on Visualization and Computer Graphics.

[19] James Demmel,et al. IEEE Standard for Floating-Point Arithmetic , 2008 .

[20] Franck Cappello,et al. Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[21] Franck Cappello,et al. Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[22] José Francisco López,et al. FPGA implementation of a lossy compression algorithm for hyperspectral images with a high-level synthesis tool , 2013, 2013 NASA/ESA Conference on Adaptive Hardware and Systems (AHS-2013).

[23] Franck Cappello,et al. Improving floating point compression through binary masks , 2013, 2013 IEEE International Conference on Big Data.

[24] Joo-Young Kim,et al. A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[25] Robert Latham,et al. Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data , 2011, Euro-Par.

[26] Monica S. Lam,et al. A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[27] Tong Liu,et al. Understanding and Modeling Lossy Compression Schemes on HPC Scientific Data , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[28] Philippe Coussy,et al. High-Level Synthesis , 2008 .

[29] Peter Deutsch,et al. GZIP file format specification version 4.3 , 1996, RFC.

[30] Martin Burtscher,et al. SPDP: An Automatically Synthesized Lossless Compression Algorithm for Floating-Point Data , 2018, 2018 Data Compression Conference.

[31] Michael W. Marcellin,et al. JPEG2000 - image compression fundamentals, standards and practice , 2002, The Kluwer International Series in Engineering and Computer Science.

[32] Mariana Vertenstein,et al. A methodology for evaluating the impact of data compression on climate simulation data , 2014, HPDC '14.

[33] Ali Murat Gok. PaSTRI : A Novel Data Compression Algorithm for Two-Electron Integrals inQuantum Chemistry ∗ Extended Abstract , 2017 .

[34] Mandy Eberhart. High Level Synthesis Introduction To Chip And System Design , 2016 .

[35] Ian T. Foster. Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales , 2017, HiPC.

[36] Giulia De Bonis,et al. Large Scale Low Power Computing System - Status of Network Design in ExaNeSt and EuroExa Projects , 2017, PARCO.

[37] Martin C. Herbordt,et al. O3BNN: an out-of-order architecture for high-performance binarized neural network inference with fine-grained pruning , 2019, ICS.

[38] Franck Cappello,et al. Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP , 2018, IEEE Transactions on Parallel and Distributed Systems.

[39] Franck Cappello,et al. Significantly improving lossy compression quality based on an optimized hybrid prediction model , 2019, SC.

[40] Tom Feist,et al. Vivado Design Suite , 2012 .

[41] André Brinkmann,et al. A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[42] Martin Burtscher,et al. FPC: A High-Speed Compressor for Double-Precision Floating-Point Data , 2009, IEEE Transactions on Computers.

[43] Gregory K. Wallace,et al. The JPEG still picture compression standard , 1992 .

[44] Mateo Valero,et al. Scalability of Macroblock-level Parallelism for H.264 Decoding , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[45] David A. Huffman,et al. A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[46] Yinqi Tang,et al. Energy-Efficient Pedestrian Detection System: Exploiting Statistical Error Compensation for Lossy Memory Data Compression , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[47] Franck Cappello,et al. Improving Performance of Data Dumping with Lossy Compression for Scientific Simulation , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[48] Satoshi Matsuoka,et al. Exploration of Lossy Compression for Application-Level Checkpoint/Restart , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[49] Wei-keng Liao,et al. Data Compression for the Exascale Computing Era - Survey , 2014, Supercomput. Front. Innov..

[50] Seung Woo Son,et al. NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.