In order to implement a distributed system with fail-soft capabilities it is necessary to specify algorithms which redistribute the work load of a failed processor to the remaining good processors. This paper develops a general model to analyze the behavior of these algorithms in a distributed system. Such algorithms should be used with caution as they have the capability of making the entire system Unstable. By unstable we mean that if a processor fails, and its workload is redistributed, then the increased workload directed towards the rest of the system could drive one or more of the processors into overload resulting in a serious degradation of system performance. Using the general model we have studied a class of load redistribution algorithms which use various techniques to redistribute workload. These techniques include: buffering jobs arriving to the failed processor, transmitting only the jobs in the queue of the failed processor, and rerouting all jobs around the failed processor. For this class of algorithms we have derived closed form expressions for the performance of the system as a function of job arrival rate, job service rate, processor failure rate, and processor service rate. In addition, we have defined a criterion which, if adhered to, will guarantee system stability in the event of failure.
[1]
Donald F. Towsley,et al.
Product Form and Local Balance in Queueing Networks
,
1977,
JACM.
[2]
Hisashi Kobayashi,et al.
Application of the Diffusion Approximation to Queueing Networks I: Equilibrium Queue Distributions
,
1974,
JACM.
[3]
Walter H. Kohler,et al.
Models for Dynamic Load Balancing in a Heterogeneous Multiple Processor System
,
1979,
IEEE Transactions on Computers.
[4]
R. Butterworth,et al.
Queueing Systems, Vol. II: Computer Applications.
,
1977
.
[5]
John F. Meyer,et al.
Performability Evaluation of the SIFT Computer
,
1980,
IEEE Transactions on Computers.
[6]
K. Mani Chandy,et al.
Open, Closed, and Mixed Networks of Queues with Different Classes of Customers
,
1975,
JACM.
[7]
M. D. Beaudry,et al.
Performance-Related Reliability Measures for Computing Systems
,
1978,
IEEE Transactions on Computers.
[8]
Johnny W. Wong,et al.
A Comparative Study of Some Two-Processor Organizations
,
1980,
IEEE Transactions on Computers.
[9]
K. Mani Chandy,et al.
Approximate Analysis of General Queuing Networks
,
1975,
IBM J. Res. Dev..