Designing for Ultrahigh Availability: The Unix RTR Operating System

Early designers of highly available computers concentrated on recovery from hardware failures to keep downtime to a minimum. But as hardware became more reliable, and systems (particularly software) became more complex, the percentage of downtime caused by hardware decreased. Achieving ultrahigh availability-on the order of a few minutes of downtime per year-requires far more than just reliable hardware. This can be seen from Table 1, which gives causes of downtime for both electronic switching 1-3 and transaction processing.4 Although the numbers differ, in both cases the hardware accounts for less than half the downtime. The other causes include the following:

[1]  M. R. Dubman,et al.  1a processor: Maintenance software , 1977, The Bell System Technical Journal.

[2]  Robert L. Glass,et al.  Software reliability guidebook , 1979 .

[3]  M. Sievers Microprogrammed control and reliable design of small computers , 1982, Proceedings of the IEEE.

[4]  W.N. Toy,et al.  Fault-tolerant design of local ESS processors , 1978, Proceedings of the IEEE.