Propitious Checkpoint Intervals to Improve System Performance

The large scale of current and next-generation massively parallel processing (MPP) systems presents significant challenges related to fault tolerance. For applications that perform periodic checkpointing, the choice of the checkpoint interval, the period between checkpoints, can have a significant impact on the execution time of the application and the number of checkpoint I/O operations performed by the application. These two metrics determine the frequency of checkpoint I/O operations performed by the application, and thereby, the contribution of the checkpoint operations to the I/O bandwidth demand made by the application. In a computing environment where there are concurrent applications competing for access to the network and storage resources, the I/O demand of each application is a crucial factor in determining the throughput of the system. Thus, in order to achieve a good overall system throughput, it is important for the application programmer to choose a checkpoint interval that balances the two opposing metrics the number of checkpoint I/O operations and the application execution time. Finding the optimal checkpoint interval that minimizes the wall clock execution time, has been a subject of research over the last decade. In this paper, we present a simple, elegant, and accurate analytical model of a complementary performance metric the aggregate number of checkpoint I/O operations. We model this and present the optimal checkpoint interval that minimizes the total number of checkpoint I/O operations. We present extensive simulation studies that validate our analytical model. Insights provided by this model, combined with existing models for wall clock execution time, facilitate application programmers in making a well informed choice of checkpoint interval leading to an appropriate trade off between execution time and number of checkpoint I/O operations. We illustrate the existence of such propitious checkpoint intervals using parameters of four MPP systems, SNL’s Red Storm, ORNL’s Jaguar, LLNL’s Blue Gene/L (BG/L), and a theoretical Petaflop system. The University of Texas at El Paso Center for Exceptional Computing

[1]  Larry Rudolph,et al.  Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.

[2]  James S. Plank,et al.  Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[3]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[4]  Ravishankar K. Iyer,et al.  Modeling coordinated checkpointing for large-scale supercomputers , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[5]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[6]  Ron A. Oldfield Lightweight storage and overlay networks for fault tolerance. , 2006 .

[7]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[8]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[9]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[10]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[11]  Larry Rudolph,et al.  Cooperative checkpointing theory , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[12]  David S. Greenberg,et al.  A System Software Architecture for High End Computing , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[13]  R. Vilalta,et al.  Providing Persistent and Consistent Resources through Event Log Analysis and Predictions for Large-scale Computing Systems , 2002 .

[14]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[15]  Jack Dongarra,et al.  Fault tolerant matrix operations for networks of workstations using multiple checkpointing , 1997, Proceedings High Performance Computing on the Information Superhighway. HPC Asia '97.

[16]  Alan D. George,et al.  Optimization of checkpointing-related I/O for high-performance parallel and distributed computing , 2007, The Journal of Supercomputing.

[17]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[18]  William H. Sanders,et al.  Performance analysis of two time-based coordinated checkpointing protocols , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[19]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.