Commercial fault tolerance: a tale of two systems

This paper compares and contrasts the design philosophies and implementations of two computer system families: the IBM S/360 and its evolution to the current zSeries line, and the Tandem (now HP) NonStop/spl reg/ Server. Both systems have a long history; the initial IBM S/360 machines were shipped in 1964, and the Tandem NonStop System was first shipped in 1976. They were aimed at similar markets, what would today be called enterprise-class applications. The requirement for the original S/360 line was for very high availability; the requirement for the NonStop platform was for single fault tolerance against unplanned outages. Since their initial shipments, availability expectations for both platforms have continued to rise and the system designers and developers have been challenged to keep up. There were and still are many similarities in the design philosophies of the two lines, including the use of redundant components and extensive error checking. The primary difference is that the S/360-zSeries focus has been on localized retry and restore to keep processors functioning as long as possible, while the NonStop developers have based systems on a loosely coupled multiprocessor design that supports a "fail-fast" philosophy implemented through a combination of hardware and software, with workload being actively taken over by another resource when one fails.

[1]  Cristian Constantinescu,et al.  Impact of deep submicron technology on dependability of VLSI circuits , 2002, Proceedings International Conference on Dependable Systems and Networks.

[2]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[3]  Robert W. Horst TNet: A Reliable System Area Network , 1995, IEEE Micro.

[4]  Thomas A. Gregg,et al.  S/390 CMOS server I/O: The continuing evolution , 1997, IBM J. Res. Dev..

[5]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[6]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[7]  Edsger W. Dijkstra,et al.  The structure of the “THE”-multiprogramming system , 1968, CACM.

[8]  Robert W. Horst,et al.  The risk of data corruption in microprocessor-based systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[9]  Per Brinch Hansen,et al.  The nucleus of a multiprogramming system , 1970, CACM.

[10]  Peter M. Chen,et al.  How fail-stop are faulty programs? , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[11]  Frederick P. Brooks,et al.  Architecture of the IBM System/360 , 2000, IBM J. Res. Dev..

[12]  Jeffrey M. Nick,et al.  S/390 Cluster Technology: Parallel Sysplex , 1997, IBM Syst. J..

[13]  David A. Patterson,et al.  Architecture and Dependability of Large-Scale Internet Services , 2002, IEEE Internet Comput..

[14]  Ravishankar K. Iyer,et al.  Software Dependability in the Tandem GUARDIAN System , 1995, IEEE Trans. Software Eng..

[15]  John S. Liptay,et al.  A high-frequency custom CMOS S/390 microprocessor , 1997, IBM J. Res. Dev..

[16]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[17]  Pak-kin Mak,et al.  The S/390 G5/G6 binodal cache , 1999, IBM J. Res. Dev..

[18]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.