Improving performance of iterative methods by lossy checkponting

Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. To this end, significantly reducing the checkpointing overhead is critical to improving the overall performance of iterative methods. Our contribution is fourfold. (1) We propose a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging lossy compressors. (2) We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. (3) We analyze the impact of lossy checkpointing (i.e., extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. (4) We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23%∼70% compared with traditional checkpointing and 20%∼58% compared with lossless-compressed checkpointing, in the presence of system failures.

[1]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[2]  Dingwen Tao,et al.  Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra , 2016, HPDC.

[3]  Katja Bachmeier,et al.  Numerical Heat Transfer And Fluid Flow , 2016 .

[4]  Shuaiwen Song,et al.  New-Sum: A Novel Online ABFT Scheme For General Iterative Methods , 2016, HPDC.

[5]  Kurt B. Ferreira,et al.  Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.

[6]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[8]  Bohn Stafleu van Loghum,et al.  Online … , 2002, LOG IN.

[9]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[10]  Dingwen Tao,et al.  Correcting soft errors online in fast fourier transform , 2017, SC.

[11]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[12]  John Shalf,et al.  On the Role of Co-design in High Performance Computing , 2012, High Performance Computing Workshop.

[13]  Dingwen Tao,et al.  Silent Data Corruption Resilient Two-sided Matrix Factorizations , 2017, PPoPP.

[14]  Zuoning Chen,et al.  A Large-Scale Study of Failures on Petascale Supercomputers , 2018, Journal of Computer Science and Technology.

[15]  Raphaël Couturier,et al.  Parallel Iterative Algorithms: From Sequential to Grid Computing (Chapman & Hall/crc Numerical Analy & Scient Comp. Series) , 2007 .

[16]  Zizhong Chen Algorithm-based recovery for iterative methods without checkpointing , 2011, HPDC '11.

[17]  Richard W. Vuduc,et al.  Self-stabilizing iterative solvers , 2013, ScalA '13.

[18]  Franck Cappello,et al.  Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Olaf Schenk,et al.  Inertia-Revealing Preconditioning For Large-Scale Nonconvex Constrained Optimization , 2008, SIAM J. Sci. Comput..

[20]  Zizhong Chen,et al.  Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[21]  Seung Woo Son,et al.  NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Franck Cappello,et al.  Fast Error-Bounded Lossy HPC Data Compression with SZ , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[23]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[24]  Ugo Becciani,et al.  Solving a very large-scale sparse linear system with a parallel algorithm in the Gaia mission , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[25]  Pradip Bose,et al.  Experience report: An application-specific checkpointing technique for minimizing checkpoint corruption , 2015, 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE).

[26]  Y. Saad,et al.  GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems , 1986 .

[27]  Jack Poulson,et al.  Scientific computing , 2013, XRDS.

[28]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[29]  Rajeev Thakur,et al.  On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[30]  Martin Burtscher,et al.  Fast lossless compression of scientific floating-point data , 2006, Data Compression Conference (DCC'06).

[31]  Tamara G. Kolda,et al.  Parallel Tensor Compression for Large-Scale Scientific Data , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[32]  Peter Lindstrom,et al.  Fixed-Rate Compressed Floating-Point Arrays , 2014, IEEE Transactions on Visualization and Computer Graphics.

[33]  Franck Cappello,et al.  Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[34]  Robert Latham,et al.  ISABELA for effective in situ compression of scientific data , 2013, Concurr. Comput. Pract. Exp..

[35]  Josue Mora Acosta,et al.  Numerical algorithms for three dimensional computational fluid dynamic problems , 2001 .

[36]  Bronis R. de Supinski,et al.  MCREngine: A scalable checkpointing system using data-aware aggregation and compression , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[37]  Martin Isenburg,et al.  Fast and Efficient Compression of Floating-Point Data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[38]  Franck Cappello,et al.  Exploring the feasibility of lossy compression for PDE simulations , 2019, Int. J. High Perform. Comput. Appl..

[39]  A. Chorin Numerical solution of the Navier-Stokes equations , 1968 .

[40]  George Bosilca,et al.  Recovery Patterns for Iterative Methods in a Parallel Unstable Environment , 2007, SIAM J. Sci. Comput..

[41]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[42]  Franck Cappello,et al.  Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[43]  William Gropp,et al.  The blue waters super-system for super-science , 2013 .

[44]  Zizhong Chen,et al.  FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines , 2014, HPDC '14.

[45]  Peter Deutsch,et al.  GZIP file format specification version 4.3 , 1996, RFC.

[46]  Emmanuel Agullo,et al.  Towards resilient parallel linear Krylov solvers: recover-restart strategies , 2013 .

[47]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[48]  William Gropp,et al.  PETSc Users Manual Revision 3.4 , 2016 .

[49]  Franck Cappello,et al.  Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[50]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[51]  Duan Li,et al.  On Restart Procedures for the Conjugate Gradient Method , 2004, Numerical Algorithms.

[52]  Satoshi Matsuoka,et al.  Exploration of Lossy Compression for Application-Level Checkpoint/Restart , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.