Exploration of Lossy Compression for Application-Level Checkpoint/Restart

The scale of high performance computing (HPC) systems is exponentially growing, potentially causing prohibitive shrinkage of mean time between failures (MTBF) while the overall increase in the I/O performance of parallel file systems will be far behind the increase in scale. As such, there have been various attempts to decrease the checkpoint overhead, one of which is to employ compression techniques to the checkpoint files. While most of the existing techniques focus on lossless compression, their compression rates and thus effectiveness remain rather limited. Instead, we propose a loss compression technique based on wavelet transformation for checkpoints, and explore its impact to application results. Experimental application of our loss compression technique to a production climate application, NICAM, shows that the overall checkpoint time including compression is reduced by 81%, while relative error remains fairly constant at approximately 1.2% on overall average of all variables of compressed physical quantities compared to original checkpoint without compression.

[1]  Franck Cappello,et al.  Improving floating point compression through binary masks , 2013, 2013 IEEE International Conference on Big Data.

[2]  Fabrizio Petrini,et al.  On the feasibility of incremental checkpointing for scientific computing , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[3]  Martin Burtscher,et al.  High Throughput Compression of Double-Precision Floating-Point Data , 2007, 2007 Data Compression Conference (DCC'07).

[4]  Ashok Srinivasan,et al.  Reducing the Disk IO Bandwidth Bottleneck through Fast Floating Point Compression using Accelerators , 2014 .

[5]  Satoshi Matsuoka,et al.  FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[6]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  N. Ahmed,et al.  Discrete Cosine Transform , 1996 .

[8]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[9]  Amara Lynn Graps,et al.  An introduction to wavelets , 1995 .

[10]  Satoshi Matsuoka,et al.  A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[11]  Masaki Satoh,et al.  Nonhydrostatic icosahedral atmospheric model (NICAM) for global cloud resolving simulations , 2008, J. Comput. Phys..

[12]  Stephen L. Scott,et al.  Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[13]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[14]  Toshio Endo,et al.  TSUBAME2.0: The First Petascale Supercomputer in Japan and the Greatest Production in the World , 2017 .

[15]  Jeffrey L. Anderson An Ensemble Adjustment Kalman Filter for Data Assimilation , 2001 .

[16]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  Thomas M. Hamill,et al.  Ensemble Data Assimilation with the NCEP Global Forecast System , 2008 .

[18]  Laxmikant V. Kale,et al.  Lossy Compression for Checkpointing: Fallible or Feasible? , 2014 .

[19]  Wei Li,et al.  Error Covariance Estimation for Coupled Data Assimilation Using a Lorenz Atmosphere and a Simple Pycnocline Ocean Model , 2013 .

[20]  Jian Yin,et al.  Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[21]  Nitin H. Vaidya,et al.  On Checkpoint Latency , 1995 .

[22]  Satoshi Matsuoka,et al.  Design and modeling of a non-blocking checkpointing system , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[24]  Dhabaleswar K. Panda,et al.  A 1 PB/s file system to checkpoint three million MPI tasks , 2013, HPDC.

[25]  Jianqin Zhou,et al.  On discrete cosine transform , 2011, ArXiv.

[26]  Dusanka Zupanski,et al.  Model Error Estimation Employing an Ensemble Data Assimilation Approach , 2006 .

[27]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[28]  Kurt B. Ferreira,et al.  On the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance , 2011, Euro-Par Workshops.

[29]  Bronis R. de Supinski,et al.  MCREngine: A scalable checkpointing system using data-aware aggregation and compression , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Martin Isenburg,et al.  Fast and Efficient Compression of Floating-Point Data , 2006, IEEE Transactions on Visualization and Computer Graphics.