A Framework for System Reliability Analysis Considering Both System Error Tolerance and Component Test Quality

The failure rate, the sources of failures and the test costs for nanometer devices are all increasing. Therefore, to create a reliable system-on-a-chip device requires designers to implement fault tolerance. However, while system-level fault tolerance could significantly relax the quality requirements of the system's building blocks, every fault-tolerant scheme only works under certain failure mechanisms and within a certain range of error probabilities. Also, designing a system with a high failure-rate component could be very expensive because the growth rate of the design complexity and the system overhead for fault tolerance could be significantly greater than the component failure rate. Therefore, it is desirable to understand the trade-offs between component test quality and system fault-tolerant capability for achieving the desired reliability under cost constraints. In this paper, we propose an analysis framework for system reliability considering (a) the test quality achieved by manufacturing testing, on-line self-checking, and off-line built-in self-test; (b) the fault-tolerant and spare schemes; and (c) the component defect and error probabilities. We demonstrate that, through proper redundancy configurations and low-cost testing to insure a certain degree of component test quality, a low-redundant system might achieve equal or higher reliability than a high-redundant system.

[1]  Melvin A. Breuer,et al.  An error-oriented test methodology to improve yield with error-tolerance , 2006, 24th IEEE VLSI Test Symposium.

[2]  Melvin A. Breuer,et al.  A novel test methodology based on error-rate to support error-tolerance , 2005, IEEE International Conference on Test, 2005..

[3]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[4]  Melvin A. Breuer Determining error rate in error tolerant VLSI chips , 2004, Proceedings. DELTA 2004. Second IEEE International Workshop on Electronic Design, Test and Applications.

[5]  Janusz Rajski,et al.  Impact of multiple-detect test patterns on product quality , 2003, International Test Conference, 2003. Proceedings. ITC 2003..

[6]  Jacob Savir,et al.  Random Pattern Testability of Delay Faults , 1988, IEEE Trans. Computers.

[7]  Melvin A. Breuer,et al.  Defect and error tolerance in the presence of massive numbers of defects , 2004, IEEE Design & Test of Computers.

[8]  Sandeep K. Gupta,et al.  An ATPG for threshold testing: obtaining acceptable yield in future processes , 2002, Proceedings. International Test Conference.

[9]  Earl E. Swartzlander,et al.  Quadruple time redundancy adders [error correcting adder] , 2003, Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems.

[10]  Jacob Savir,et al.  Built In Test for VLSI: Pseudorandom Techniques , 1987 .

[11]  John P. Hayes,et al.  Online BIST for Embedded Systems , 1998, IEEE Des. Test Comput..

[12]  Martin L. Shooman,et al.  Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design , 2002 .

[13]  Todd M. Austin DIVA: A Dynamic Approach to Microprocessor Verification , 2000, J. Instr. Level Parallelism.