Reliability Improvement through Redundancy at Various System Levels

Improvement in computing machine reliability through redundancy is studied as a function of the level at which the redundancy is applied. The reliability achieved by redundancy of complete, independent machines is compared to that achieved by redundancy of smaller units. A machine unit is termed m times redundant when the following conditions exist: 1. m independent identical units operate simultaneously with a common input. 2. A failure detector is associated with each unit. 3. A switch is connected to the outputs of the units, so that the output is taken from some one unit until failure occurs in that unit. Then the switch steps so that the output is taken from the next redundant unit, if that unit is operating correctly. The process continues until the assigned task is completed or all m units fail. The reliability of m redundant units is expressed in terms of the reliability of one unit and the probabilities of correct operation of the failure detectors and switch. It is assumed that a complete machine may be broken up into p units, p= 1, 2, 3, ..., 24, of equal reliability. The reliability achieved by redundancy of these units is calculated as a function of p and m, m = 1, 2, 3, 4, with single-machine reliabilities of 0.2, 0.5, 0.9 and 0.99. These results are calculated for perfect failure detection and switching devices as well as for moderately unreliable devices. The resultant system unreliability is plotted as a function of p on linear and on logarithmic scales.