Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems

Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none addresses the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we present a performance model for long-running parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today's parallel computing environments and software, and present case studies of using the model to select runtime parameters.

[1]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[2]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[3]  S. Yajnik,et al.  Checkpointing in CosMiC: a user-level process migration environment , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[4]  John G. Kemeny,et al.  Finite Markov Chains. , 1960 .

[5]  Nozer D. Singpurwalla,et al.  An Empirically Developed Fourier Series Model for Describing Software Failures , 1984, IEEE Transactions on Reliability.

[6]  J. Griffiths The Theory of Stochastic Processes , 1967 .

[7]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[8]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[9]  N. U. Prabhu Review: D. R. Cox, H. D. Miller, The Theory of Stochastic Processes , 1966 .

[10]  James S. Plank,et al.  Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[11]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[12]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[13]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[14]  Jonathan Walpole,et al.  MIST: PVM with Transparent Migration and Checkpointing , 1995 .

[15]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[16]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[17]  James S. Plank,et al.  Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[18]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[19]  John G. Kemeny,et al.  Finite Markov chains , 1960 .

[20]  Emanuel Parzen,et al.  Stochastic Processes , 1962 .

[21]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[22]  Miron Livny,et al.  Managing Checkpoints for Parallel Programs , 1996, JSSPP.

[23]  Darrell D. E. Long,et al.  A longitudinal survey of Internet host reliability , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[24]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[25]  William H. Sanders,et al.  Performance analysis of two time-based coordinated checkpointing protocols , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[26]  Mark A. Franklin,et al.  Checkpointing in Distributed Computing Systems , 1996, J. Parallel Distributed Comput..

[27]  Georg Stellner Consistent Checkpoints of PVM Applications , 1994 .

[28]  A. Barbour,et al.  Poisson Approximation , 1992 .

[29]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[30]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[31]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[32]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[33]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[34]  Kim Buckner Timings and memory usage for the NAS Parallel Benchmarks on anetwork of Sun Ultra Workstations , 1998 .

[35]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[36]  James S. Plank,et al.  The average availability of parallel checkpointing systems and its importance in selecting runtime parameters , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[37]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[38]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[39]  W YoungJohn A first order approximation to the optimum checkpoint interval , 1974 .

[40]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.