IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective

Fault tolerance in IBM S/390® systems during the 1980s and 1990s had three distinct phases, each characterized by a different uptime improvement rate. Early TCM-technology mainframes delivered excellent data integrity, instantaneous error detection, and positive fault isolation, but had limited on-line repair. Later TCM mainframes introduced capabilities for providing a high degree of transparent recovery, failure masking, and on-line repair. New challenges accompanied the introduction of CMOS technology. A significant reduction in parts count greatly improved intrinsic failure rates, but dense packaging disallowed on-line CPU repair. In addition, characteristics of the microprocessor technology posed difficulties for traditional in-line error checking. As a result, system fault-tolerant design, particularly in CPUs and memory, underwent another evolution from G1 to G5. G5 implements an innovative design for a high-performance, fault-tolerant single-chip microprocessor. Dynamic CPU sparing delivers a transparent concurrent repair mechanism. A new internal channel provides a high-performance, highly available Parallel Sysplex® in a single mainframe. G5 is both the culmination of decades of innovation and careful implementation, and the highest achievement of S/390 fault-tolerant design.

[1]  Peter M. Chen,et al.  How fail-stop are faulty programs? , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[2]  Thomas A. Gregg,et al.  The Integrated Cluster Bus for the IBM S/390 Parallel Sysplex , 1999, IBM J. Res. Dev..

[3]  M. Y. Hsiao,et al.  Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress , 1981, IBM J. Res. Dev..

[4]  Ram Chillarege,et al.  IBM's ES/9000 Model 982's fault-tolerant design for consolidation , 1994, IEEE Micro.

[5]  John S. Liptay,et al.  A high-frequency custom CMOS S/390 microprocessor , 1997, IBM J. Res. Dev..

[6]  Lisa Spainhower,et al.  G4: a fault-tolerant CMOS mainframe , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[7]  Robert W. Horst,et al.  The risk of data corruption in microprocessor-based systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[8]  M. Y. Hsiao,et al.  Model for Transient and Permanent Error-Detection and Fault-Isolation Coverage , 1982, IBM J. Res. Dev..

[9]  Ram Chillarege,et al.  Design for fault-tolerance in system ES model 900 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[10]  Jeffrey M. Nick,et al.  S/390 Cluster Technology: Parallel Sysplex , 1997, IBM Syst. J..

[11]  Pak-kin Mak,et al.  The S/390 G5/G6 binodal cache , 1999, IBM J. Res. Dev..

[12]  Motoei Azuma,et al.  Dependable Computing and Fault-Tolerant Systems, Vol. 5 - Dependability: Basic Concepts and Terminology, by J. C. Laprie (Editor), Springer-Verlag, 1992 (Book Review) , 1992, Softw. Test. Verification Reliab..

[13]  Thomas A. Gregg,et al.  S/390 CMOS server I/O: The continuing evolution , 1997, IBM J. Res. Dev..