IBM z990 soft error detection and recovery

Soft errors in logic are becoming more significant in the design of computer systems due to increased sensitivities of latches and combinatorial logic and the increased number of transistors on a chip. At the same time, users of computer systems continue to expect higher levels of system reliability. Therefore, the investment in hardware and firmware software mitigation is likely to continue to rise. The IBM eServer z990 system is designed to detect and recover from myriad instances of soft and permanent errors. The error detection and recovery within the z990 processors and the "nest" chips is described with respect to the system level protection against soft errors.

[1]  Luiz C. Alves,et al.  Reliability, availability, and serviceability (RAS) of the IBM eServer z990 , 2004, IBM J. Res. Dev..

[2]  Pak-kin Mak,et al.  The S/390 G5/G6 binodal cache , 1999, IBM J. Res. Dev..

[3]  Lisa Spainhower,et al.  G4: a fault-tolerant CMOS mainframe , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[4]  Michael Mueller,et al.  RAS strategy for IBM S/390 G5 and G6 , 1999, IBM J. Res. Dev..

[5]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..