Implementing End-to-End Register Data-Flow Continuous Self-Test

While Moore's Law predicts the ability of semiconductor industry to engineer smaller and more efficient transistors and circuits, there are serious issues not contemplated in that law. One concern is the verification effort of modern computing systems, which has grown to dominate the cost of system design. On the other hand, technology scaling leads to burn-in phase out. As a result, in-the-field error rate may increase due to both actual errors and latent defects. Whereas data can be protected with arithmetic codes, there is a lack of cost-effective mechanisms for control logic. This paper presents a light-weight microarchitectural mechanism that ensures that data consumed through registers are correct. The structures protected include the issue queue logic and the data associated (i.e., tags and control signals), input multiplexors, rename data, replay logic, register free-list and release logic, and register file logic. Our results show a coverage around 90 percent for the targeted structures with a cost in power and area of about four percent, and without impact in performance.

[1]  Aneesh Aggarwal,et al.  Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[2]  David M. Wu,et al.  An optimized DFT and test pattern generation strategy for an Intel high performance microprocessor , 2004, 2004 International Conferce on Test.

[3]  Cheng-Wen Wu,et al.  Failure factor based yield enhancement for SRAM designs , 2004, 19th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 2004. DFT 2004. Proceedings..

[4]  Eric Rotenberg,et al.  Assertion-Based Microarchitecture Design for Improved Fault Tolerance , 2006, 2006 International Conference on Computer Design.

[5]  Babak Falsafi,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[6]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[7]  Michael Mueller,et al.  RAS strategy for IBM S/390 G5 and G6 , 1999, IBM J. Res. Dev..

[8]  Babak Falsafi,et al.  Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[9]  Eric Rotenberg,et al.  Exploiting microarchitecture insights for efficient fault tolerance , 2007 .

[10]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[11]  Jaume Abella,et al.  Implementing End-to-End Register Data-Flow Continuous Self-Test , 2011, IEEE Trans. Computers.

[12]  Todd M. Austin,et al.  Ultra low-cost defect protection for microprocessor pipelines , 2006, ASPLOS XII.

[13]  Onur Mutlu,et al.  Microarchitecture-based introspection: a technique for transient-fault tolerance in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[14]  Babak Falsafi,et al.  Detecting Emerging Wearout Faults , 2007 .

[15]  H. Ando,et al.  A 1.3GHz fifth generation SPARC64 microprocessor , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[16]  Albert Meixner,et al.  Error Detection Using Dynamic Dataflow Verification , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[17]  Adit D. Singh,et al.  Extending integrated-circuit yield-models to estimate early-life reliability , 2003, IEEE Trans. Reliab..

[18]  Sandip Kundu,et al.  Trends in manufacturing test methods and their implications , 2004, 2004 International Conferce on Test.

[19]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[20]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[21]  J. Jopling,et al.  Erratic fluctuations of sram cache vmin at the 90nm process technology node , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[22]  Edward J. McCluskey,et al.  Which concurrent error detection scheme to choose ? , 2000, Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159).

[23]  Sudhakar M. Reddy,et al.  On the effectiveness of residue code checking for parallel two's complement multipliers , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[24]  Onur Mutlu,et al.  Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[25]  Algirdas Avizienis,et al.  Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design , 1971, IEEE Transactions on Computers.

[26]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[27]  Subhasish Mitra,et al.  IFRA: Instruction Footprint Recording and Analysis for post-silicon bug localization in processors , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[28]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[29]  Glen G. Langdon,et al.  Concurrent error detection for group look-ahead binary adders , 1970 .

[30]  R. Rajsuman Rambist builder: a methodology for automatic built-in self-test design of embedded rams , 1996, IEEE International Workshop on Memory Technology, Design and Testing,.

[31]  Jaume Abella,et al.  On-line Failure Detection in Memory Order Buffers , 2008, 2008 IEEE International Test Conference.

[32]  David I. August,et al.  Design and evaluation of hybrid fault-detection systems , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[33]  Jien-Chung Lo Reliable Floating-Point Arithmetic Algorithms for Error-Coded Operands , 1994, IEEE Trans. Computers.

[34]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[35]  Eric Rotenberg,et al.  A study of slipstream processors , 2000, MICRO 33.

[36]  T. N. Vijaykumar,et al.  Opportunistic transient-fault detection , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[37]  James Tschanz,et al.  A Low Cost Scheme for Reducing Silent Data Corruption in Large Arithmetic Circuits , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[38]  Jaume Abella,et al.  On-Line Failure Detection and Confinement in Caches , 2008, 2008 14th IEEE International On-Line Testing Symposium.

[39]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[40]  Nhon Quach,et al.  High Availability and Reliability in the Itanium Processor , 2000, IEEE Micro.

[41]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[42]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[43]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).