Application of Petri net models for the evaluation of fault-tolerant techniques in distributed systems

Analytical models are presented that use Petri nets for fault-tolerant schemes used in distributed systems. These models are used in the quantitative evaluation and selection of good fault-tolerant schemes for specific system configurations. Several different fault-tolerant schemes that can be modeled using Petri nets are discussed in detail. These schemes include rollback recovery with checkpointing, recovery blocks, N-version programming, and conversations. After a brief review of Petri net models, extension of the Petri net models to incorporate fault-tolerant schemes is considered. A methodology for evaluating a fault-tolerant scheme for a specific system configuration and the steps involved in building a Petri net model of a fault-tolerant system are described. The subnet primitives involved in building these models are identified and an algorithm for building the models automatically is described. Examples illustrating this extended Petri net model are discussed and numerical results are presented to show the applicability of the models.<<ETX>>

[1]  Michael K. Molloy Performance Analysis Using Stochastic Petri Nets , 1982, IEEE Transactions on Computers.

[2]  C. V. Ramamoorthy,et al.  Performance Evaluation of Asynchronous Concurrent Systems Using Petri Nets , 1980, IEEE Transactions on Software Engineering.

[3]  Nancy G. Leveson,et al.  Safety Analysis Using Petri Nets , 1987, IEEE Transactions on Software Engineering.

[4]  Amr El Abbadi,et al.  Implementing Fault-Tolerant Distributed Objects , 1985, IEEE Transactions on Software Engineering.

[5]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[6]  Jean-Pierre Courtiat,et al.  REBUS, A Fault-Tolerant Distributed System for Industrial Real-Time Control , 1982, IEEE Transactions on Computers.

[7]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[8]  Robert A. Nelson,et al.  Casting Petri Nets into Programs , 1983, IEEE Transactions on Software Engineering.

[9]  Kang G. Shin,et al.  Optimal Checkpointing of Real-Time Tasks , 1987, IEEE Transactions on Computers.

[10]  Nick Roussopoulos,et al.  Timing Requirements for Time-Driven Systems Using Augmented Petri Nets , 1983, IEEE Transactions on Software Engineering.

[11]  Andrew M. Tyrrell,et al.  Design of reliable software in distributed systems using the conversation scheme , 1986, IEEE Transactions on Software Engineering.

[12]  John C. Knight,et al.  A Framework for Software Fault Tolerance in Real-Time Systems , 1983, IEEE Transactions on Software Engineering.

[13]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[14]  Liba Svobodova Resilient Distributed Computing , 1984, IEEE Transactions on Software Engineering.

[15]  Brian Randell,et al.  Error recovery in asynchronous systems , 1986, IEEE Transactions on Software Engineering.

[16]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[17]  Brian Randell System structure for software fault tolerance , 1975 .

[18]  P. Merlin,et al.  Recoverability of Communication Protocols - Implications of a Theoretical Study , 1976, IEEE Transactions on Communications.

[19]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[20]  Giovanni Chiola,et al.  A Software Package for the Analysis of Generalized Stochastic Petri Net Models , 1985, PNPM.