waveSZ: a hardware-algorithm co-design of efficient lossy compression for scientific data

Error-bounded lossy compression is critical to the success of extreme-scale scientific research because of ever-increasing volumes of data produced by today's high-performance computing (HPC) applications. Not only can error-controlled lossy compressors significantly reduce the I/O and storage burden but they can retain high data fidelity for post analysis. Existing state-of-the-art lossy compressors, however, generally suffer from relatively low compression and decompression throughput (up to hundreds of megabytes per second on a single CPU core), which considerably restrict the adoption of lossy compression by many HPC applications especially those with a fairly high data production rate. In this paper, we propose a highly efficient lossy compression approach based on field programmable gate arrays (FPGAs) under the state-of-the-art lossy compression model SZ. Our contributions are fourfold. (1) We adopt a wavefront memory layout to alleviate the data dependency during the prediction for higher-dimensional predictors, such as the Lorenzo predictor. (2) We propose a co-design framework named waveSZ based on the wavefront memory layout and the characteristics of SZ algorithm and carefully implement it by using high-level synthesis. (3) We propose a hardware-algorithm co-optimization method to improve the performance. (4) We evaluate our proposed waveSZ on three real-world HPC simulation datasets from the Scientific Data Reduction Benchmarks and compare it with other state-of-the-art methods on both CPUs and FPGAs. Experiments show that our waveSZ can improve SZ's compression throughput by 6.9X ~ 8.7X over the production version running on a state-of-the-art CPU and improve the compression ratio and throughput by 2.1X and 5.8X on average, respectively, compared with the state-of-the-art FPGA design.

[1]  Nikolaos G. Bourbakis,et al.  An architecture for video compression based on the SCAN algorithm , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[2]  Franck Cappello,et al.  An Efficient Transformation Scheme for Lossy Data Compression with Point-Wise Relative Error Bound , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[3]  Franck Cappello,et al.  Error-Controlled Lossy Compression Optimized for High Compression Ratios of Scientific Datasets , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[4]  Philippe Coussy,et al.  High-Level Synthesis: from Algorithm to Digital Circuit , 2008 .

[5]  W. Bishop,et al.  FPGA-Based Lossless Data Compression using Huffman and LZ77 Algorithms , 2007, 2007 Canadian Conference on Electrical and Computer Engineering.

[6]  Paul Molitor,et al.  A pipelined architecture for partitioned DWT based lossy image compression using FPGA's , 2001, FPGA '01.

[7]  P. Mininni,et al.  Interactive desktop analysis of high resolution simulations: application to turbulent plume dynamics and current sheet formation , 2007 .

[8]  Martin C. Herbordt,et al.  GhostSZ: A Transparent FPGA-Accelerated Lossy Compression Framework , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[9]  Martin Isenburg,et al.  Fast and Efficient Compression of Floating-Point Data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[10]  Jason Cong,et al.  An FPGA-Based BWT Accelerator for Bzip2 Data Compression , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[11]  Edward J. McCluskey,et al.  A reliable LZ data compressor on reconfigurable coprocessors , 2000, Proceedings 2000 IEEE Symposium on Field-Programmable Custom Computing Machines (Cat. No.PR00871).

[12]  Jarek Rossignac,et al.  Out‐of‐core compression and decompression of large n‐dimensional scalar fields , 2003, Comput. Graph. Forum.

[13]  Jason Cong,et al.  High-Throughput Lossless Compression on Tightly Coupled CPU-FPGA Platforms , 2018, FCCM.

[14]  Mohamed S. Abdelfattah,et al.  Gzip on a chip: high performance lossless data compression on FPGAs using OpenCL , 2014, IWOCL '14.

[15]  Franck Cappello,et al.  Use cases of lossy compression for floating-point data in scientific data sets , 2019, Int. J. High Perform. Comput. Appl..

[16]  Hal Finkel,et al.  HACC , 2016, Commun. ACM.

[17]  Peng Deng,et al.  PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization , 2015, ICS.

[18]  Peter Lindstrom,et al.  Fixed-Rate Compressed Floating-Point Arrays , 2014, IEEE Transactions on Visualization and Computer Graphics.

[19]  James Demmel,et al.  IEEE Standard for Floating-Point Arithmetic , 2008 .

[20]  Franck Cappello,et al.  Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[21]  Franck Cappello,et al.  Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[22]  José Francisco López,et al.  FPGA implementation of a lossy compression algorithm for hyperspectral images with a high-level synthesis tool , 2013, 2013 NASA/ESA Conference on Adaptive Hardware and Systems (AHS-2013).

[23]  Franck Cappello,et al.  Improving floating point compression through binary masks , 2013, 2013 IEEE International Conference on Big Data.

[24]  Joo-Young Kim,et al.  A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[25]  Robert Latham,et al.  Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data , 2011, Euro-Par.

[26]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[27]  Tong Liu,et al.  Understanding and Modeling Lossy Compression Schemes on HPC Scientific Data , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[28]  Philippe Coussy,et al.  High-Level Synthesis , 2008 .

[29]  Peter Deutsch,et al.  GZIP file format specification version 4.3 , 1996, RFC.

[30]  Martin Burtscher,et al.  SPDP: An Automatically Synthesized Lossless Compression Algorithm for Floating-Point Data , 2018, 2018 Data Compression Conference.

[31]  Michael W. Marcellin,et al.  JPEG2000 - image compression fundamentals, standards and practice , 2002, The Kluwer International Series in Engineering and Computer Science.

[32]  Mariana Vertenstein,et al.  A methodology for evaluating the impact of data compression on climate simulation data , 2014, HPDC '14.

[33]  Ali Murat Gok PaSTRI : A Novel Data Compression Algorithm for Two-Electron Integrals inQuantum Chemistry ∗ Extended Abstract , 2017 .

[34]  Mandy Eberhart High Level Synthesis Introduction To Chip And System Design , 2016 .

[35]  Ian T. Foster Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales , 2017, HiPC.

[36]  Giulia De Bonis,et al.  Large Scale Low Power Computing System - Status of Network Design in ExaNeSt and EuroExa Projects , 2017, PARCO.

[37]  Martin C. Herbordt,et al.  O3BNN: an out-of-order architecture for high-performance binarized neural network inference with fine-grained pruning , 2019, ICS.

[38]  Franck Cappello,et al.  Optimizing Lossy Compression Rate-Distortion from Automatic Online Selection between SZ and ZFP , 2018, IEEE Transactions on Parallel and Distributed Systems.

[39]  Franck Cappello,et al.  Significantly improving lossy compression quality based on an optimized hybrid prediction model , 2019, SC.

[40]  Tom Feist,et al.  Vivado Design Suite , 2012 .

[41]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Martin Burtscher,et al.  FPC: A High-Speed Compressor for Double-Precision Floating-Point Data , 2009, IEEE Transactions on Computers.

[43]  Gregory K. Wallace,et al.  The JPEG still picture compression standard , 1992 .

[44]  Mateo Valero,et al.  Scalability of Macroblock-level Parallelism for H.264 Decoding , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[45]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[46]  Yinqi Tang,et al.  Energy-Efficient Pedestrian Detection System: Exploiting Statistical Error Compensation for Lossy Memory Data Compression , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[47]  Franck Cappello,et al.  Improving Performance of Data Dumping with Lossy Compression for Scientific Simulation , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[48]  Satoshi Matsuoka,et al.  Exploration of Lossy Compression for Application-Level Checkpoint/Restart , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[49]  Wei-keng Liao,et al.  Data Compression for the Exascale Computing Era - Survey , 2014, Supercomput. Front. Innov..

[50]  Seung Woo Son,et al.  NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.