An analytical model is used to investigate the effects of checkpointing on the performance and availability of sequential and parallel applications. Known as Steady-State Performability (SSP), this model provides a probabilistic method for quantifying delivered performance considering failure and recovery. Input parameters describe both the distributed application and the processing environment. Terms quantify computation effort, as well as overheads for communication, synchronization and fault tolerance. By unifying performance and availability, fault tolerance overheads can be justified. The model emphasizes simplicity over detail in hopes of guiding the application developer through design and into implementation. Key Terms: reliability and performance modeling, parallel and distributed systems. 1.0 INTRODUCTION Typically, performance and fault tolerance are placed into separate categories. For example, topics such as parallel processing, RISC, and compiler optimization fall into the performance category; on the other hand, triple-modular redundancy, checkpointing and forward recovery fall into the fault tolerance category. In many cases, these two aspects in computer science are considered orthogonal [13]. Performability extends performance to include fault tolerance issues by identifying characteristics of computer applications which effect the delivered performance considering failures. In this manuscript, distributed applications are studied which operate upon a input set of values with a given size. The size of the input set may not be evident at the onset of execution, but is considered fixed at the instant execution begins. Since the input size and the processing environment are fixed for a given execution, then the expected time of computation is also considered fixed. This computation time is split into checkpoint intervals of equal length. At the end of each interval a checkpoint is performed. A rollback and recovery model [15] is assumed for recovery from process failure within an application.
[1]
Nicholas Carriero,et al.
How to write parallel programs: a guide to the perplexed
,
1989,
CSUR.
[2]
Edith Schonberg,et al.
Factoring: a method for scheduling parallel loops
,
1992
.
[3]
J. P. Dougherty,et al.
Monte Carlo integration in a distributed heterogeneous environment
,
1993,
[1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.
[4]
Terry Williams,et al.
Probability and Statistics with Reliability, Queueing and Computer Science Applications
,
1983
.
[5]
Pankaj Jalote,et al.
Fault tolerance in distributed systems
,
1994
.
[6]
Ramesh Subramonian,et al.
LogP: towards a realistic model of parallel computation
,
1993,
PPOPP '93.
[7]
Shing-Chi Cheung,et al.
Tractable Dataflow Analysis for Distributed Systems
,
1994,
IEEE Trans. Software Eng..
[8]
Theodore Gyle Lewis.
Foundations of parallel programming - a machine-independent approach
,
1994
.
[9]
Nicholas Carriero,et al.
Adaptive Parallelism and Piranha
,
1995,
Computer.
[10]
Jack J. Dongarra,et al.
Solving Computational Grand Challenges Using a Network of Heterogeneous Supercomputers
,
1991,
PPSC.
[11]
Kishor S. Trivedi,et al.
Dependability and Performability Analysis
,
1993,
Performance/SIGMETRICS Tutorials.
[12]
Barry J. Gleeson,et al.
Fault Tolerance: Why Should I Pay for It?
,
1994,
Hardware and Software Architectures for Fault Tolerance.
[13]
CONSTANTINE D. POLYCHRONOPOULOS,et al.
Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers
,
1987,
IEEE Transactions on Computers.
[14]
Gregory R. Andrews,et al.
Paradigms for process interaction in distributed programs
,
1991,
CSUR.