Validation of the fault/error handling mechanisms of the Teraflops supercomputer

The Teraflops system, the world's most powerful supercomputer, was developed by Intel Corporation for the US Department of Energy (DOE) as part of the Accelerated Strategic Computing Initiative (ASCI). The machine contains more than 9000 Intel Pentium (R) Pro processors and performs over one trillion floating point operations per second. Complex hardware and software mechanisms were devised for complying with DOE's reliability requirements. This paper gives a brief description of the Teraflops system architecture and presents the validation of the fault/error handling mechanisms. The validation process was based on an enhanced version of the physical fault injection at the IC pin level. An original approach was developed for assessing signal sensitivity to transient faults and the effectiveness of the fault tolerance mechanisms. Several malfunctions were unveiled by the fault injection experiments. After corrective actions had been undertaken, the supercomputer performed according to the specification.

[1]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[2]  J. Karlsson,et al.  Application of Three Physical Fault Injection Techniques to the Experimental Assessment of the MARS Architecture , 1995 .

[3]  C. Constantinescu Using multi-stage and stratified sampling for inferring fault-coverage probabilities , 1995 .

[4]  Chris J. Walter Evaluation and design of an ultra-reliable distributed architecture for fault tolerance , 1990 .

[5]  Jean Arlat,et al.  Fault Injection for Dependability Validation: A Methodology and Some Applications , 1990, IEEE Trans. Software Eng..

[6]  Henrique Madeira,et al.  Experimental assessment of parallel systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[7]  Johan Karlsson,et al.  Using heavy-ion radiation to validate fault-handling mechanisms , 1994, IEEE Micro.

[8]  Johan Karlsson,et al.  Fault injection into VHDL models: the MEFISTO tool , 1994 .

[9]  Ravishankar K. Iyer,et al.  DEPEND: A Simulation-Based Environment for System Level Dependability Analysis , 1997, IEEE Trans. Computers.

[10]  Ravishankar K. Iyer,et al.  Experimental evaluation , 1995 .

[11]  C. Constantinescu Estimation of coverage probabilities for dependability validation of fault-tolerant computing systems , 1994, Proceedings of COMPASS'94 - 1994 IEEE 9th Annual Conference on Computer Assurance.

[12]  Parag K. Lala,et al.  Fault tolerant and fault testable hardware design , 1985 .

[13]  Jacob A. Abraham,et al.  FERRARI: a tool for the validation of system dependability properties , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[14]  Ram Chillarege,et al.  Understanding large system failures-a fault injection experiment , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[15]  David Scott,et al.  A TeraFLOP supercomputer in 1996: the ASCI TFLOP system , 1996, Proceedings of International Conference on Parallel Processing.

[16]  Henrique Madeira,et al.  RIFLE: A General Purpose Pin-level Fault Injector , 1994, EDCC.

[17]  Daniel P. Siewiorek,et al.  FIAT-fault injection based automated testing environment , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.