Data Compression for the Exascale Computing Era - Survey

While periodic checkpointing has been an important mechanism for tolerating faults in high performance computing HPC systems, it is cost-prohibitive as the HPC system approaches exascale. Applying compression techniques is one common way to mitigate such burdens by reducing the data size, but they are often found to be less effective for scientific datasets. Traditional lossless compression techniques that look for repeated patterns are ineffective for scientific data in which high-precision data is used and hence common patterns are rare to find. In this paper, we present a comparison of several lossless and lossy data compression algorithms and discuss their methodology under the exascale environment. As data volume increases, we discover an increasing trend of new domain-driven algorithms that exploit the inherent characteristics exhibited in many scientific dataset, such as relatively small changes in data values from one simulation iteration to the next or among neighboring data. In particular, significant data reduction has been observed in lossy compression. This paper also discusses how the errors introduced by lossy compressions are controlled and the tradeoffs with the compression ratio.

[1]  Ruoming Jin,et al.  Fast and exact out-of-core and distributed k-means clustering , 2006, Knowledge and Information Systems.

[2]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[3]  Wei-keng Liao,et al.  High Performance Big Data Clustering , 2012, High Performance Computing Workshop.

[4]  Khalid Sayood,et al.  Introduction to Data Compression , 1996 .

[5]  Jian Yin,et al.  Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[6]  Josep Torrellas Architectures for Extreme-Scale Computing , 2009, Computer.

[7]  Robert Latham,et al.  ISOBAR Preconditioner for Effective and High-throughput Lossless Data Compression , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[8]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[9]  Robert B. Ross,et al.  ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization , 2012, HPDC '12.

[10]  Karl E. Taylor,et al.  An overview of CMIP5 and the experiment design , 2012 .

[11]  Thomas Ludwig,et al.  Evaluating Lossy Compression on Climate Data , 2013, ISC.

[12]  N.B. Karayiannis,et al.  Fuzzy vector quantization algorithms and their application in image compression , 1995, IEEE Trans. Image Process..

[13]  Les A. Piegl,et al.  Data reduction using cubic rational B-splines , 1992, IEEE Computer Graphics and Applications.

[14]  Martin Burtscher,et al.  FPC: A High-Speed Compressor for Double-Precision Floating-Point Data , 2009, IEEE Transactions on Computers.

[15]  Didier Le Gall,et al.  MPEG: a video compression standard for multimedia applications , 1991, CACM.

[16]  Karsten Schwan,et al.  Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS) , 2008, CLADE '08.

[17]  Khalid Sayood,et al.  Introduction to data compression (2nd ed.) , 2000 .

[18]  Michael Frazier An introduction to wavelets through linear algebra , 1999 .

[19]  Martin Isenburg,et al.  Fast and Efficient Compression of Floating-Point Data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[20]  Robert B. Ross,et al.  ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying , 2013, Trans. Large Scale Data Knowl. Centered Syst..

[21]  Mariana Vertenstein,et al.  A methodology for evaluating the impact of data compression on climate simulation data , 2014, HPDC '14.

[22]  Peter Lindstrom,et al.  Assessing the effects of data compression in simulations using physically motivated metrics , 2013, SC.

[23]  ChoudharyAlok,et al.  Data Compression for the Exascale Computing Era - Survey , 2014 .

[24]  Leonid Oliker,et al.  Energy-Efficient Computing for Extreme-Scale Science , 2009, Computer.

[25]  Ron Brightwell,et al.  On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance , 2012, 2012 41st International Conference on Parallel Processing.

[26]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[27]  Jason Duell,et al.  The design and implementation of Berkeley Lab's linuxcheckpoint/restart , 2005 .

[28]  Franck Cappello,et al.  Improving floating point compression through binary masks , 2013, 2013 IEEE International Conference on Big Data.

[29]  Bronis R. de Supinski,et al.  McrEngine: a scalable checkpointing system using data-aware aggregation and compression , 2012, HiPC 2012.

[30]  Robert Latham,et al.  Compressing the Incompressible with ISABELA: In-situ Reduction of Spatio-temporal Data , 2011, Euro-Par.

[31]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[32]  M.-C. Su,et al.  A new cluster validity measure and its application to image compression , 2004, Pattern Analysis and Applications.

[33]  Robert B. Ross,et al.  Improving I/O Forwarding Throughput with Data Compression , 2011, 2011 IEEE International Conference on Cluster Computing.

[34]  Peter Desnoyers,et al.  Active Flash: Out-of-core data analytics on flash storage , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[35]  Nagiza F. Samatova,et al.  Improving I/O Throughput with PRIMACY: Preconditioning ID-Mapper for Compressing Incompressibility , 2012, 2012 IEEE International Conference on Cluster Computing.