Hardware/Software Codesign Architecture for Online Testing in Chip Multiprocessors

As the semiconductor industry continues its relentless push for nano-CMOS technologies, long-term device reliability and occurrence of hard errors have emerged as a major concern. Long-term device reliability includes parametric degradation that results in loss of performance as well as hard failures that result in loss of functionality. It has been reported in the ITRS roadmap that effectiveness of traditional burn-in test in product life acceleration is eroding. Thus, to assure sufficient product reliability, fault detection and system reconfiguration must be performed in the field at runtime. Although regular memory structures are protected against hard errors using error-correcting codes, many structures within cores are left unprotected. Several proposed online testing techniques either rely on concurrent testing or periodically check for correctness. These techniques are attractive, but limited due to significant design effort and hardware cost. Furthermore, lack of observability and controllability of microarchitectural states result in long latency, long test sequences, and large storage of golden patterns. In this paper, we propose a low-cost scheme for detecting and debugging hard errors with a fine granularity within cores and keeping the faulty cores functional, with potentially reduced capability and performance. The solution includes both hardware and runtime software based on codesigned virtual machine concept. It has the ability to detect, debug, and isolate hard errors in small noncache array structures, execution units, and combinational logic within cores. Hardware signature registers are used to capture the footprint of execution at the output of functional modules within the cores. A runtime layer of software (microvisor) initiates functional tests concurrently on multiple cores to capture the signature footprints across cores to detect, debug, and isolate hard errors. Results show that using targeted set of functional test sequences, faults can be debugged to a fine-granular level within cores. The hardware cost of the scheme is less than three percent, while the software tasks are performed at a high-level, resulting in a relatively low design effort and cost.

[1]  Babak Falsafi,et al.  Detecting Emerging Wearout Faults , 2007 .

[2]  S. Mitra Circuit failure prediction for robust system design in scaled CMOS , 2008, 2008 IEEE International Reliability Physics Symposium.

[3]  Ansi Ieee,et al.  IEEE Standard for Binary Floating Point Arithmetic , 1985 .

[4]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[5]  Gürhan Küçük,et al.  Dynamic resizing of superscalar datapath components for energy efficiency , 2006, IEEE Transactions on Computers.

[6]  William Lindsay,et al.  FRITS - a microprocessor functional BIST method , 2002, Proceedings. International Test Conference.

[7]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[8]  Szu-Liang Chen,et al.  The 65nm 16MB On-Die L3 Cache for a Dual Core Multi-Threaded Xeon/sup ~/ Processor , 2006, 2006 Symposium on VLSI Circuits, 2006. Digest of Technical Papers..

[9]  Brad Calder,et al.  Picking statistically valid and early simulation points , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[10]  Onur Mutlu,et al.  Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[11]  Mihalis Psarakis,et al.  Software-Based Self-Testing of Symmetric Shared-Memory Multiprocessors , 2009, IEEE Transactions on Computers.

[12]  Omer Khan,et al.  Thread Relocation: A Runtime Architecture for Tolerating Hard Errors in Chip Multiprocessors , 2010, IEEE Transactions on Computers.

[13]  S. Pae,et al.  Random charge effects for PMOS NBTI in ultra-small gate area devices , 2005, 2005 IEEE International Reliability Physics Symposium, 2005. Proceedings. 43rd Annual..

[14]  Vishwani D. Agrawal,et al.  Essentials of electronic testing for digital, memory, and mixed-signal VLSI circuits [Book Review] , 2000, IEEE Circuits and Devices Magazine.

[15]  Guido D. Salvucci,et al.  Ieee standard for binary floating-point arithmetic , 1985 .

[16]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[17]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[18]  Omer Khan,et al.  A self-adaptive system architecture to address transistor aging , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[19]  Sarita V. Adve,et al.  Trace-based microarchitecture-level diagnosis of permanent hardware faults , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[20]  Shantanu Gupta,et al.  Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[21]  Edward J. McCluskey,et al.  Stuck-fault tests vs. actual defects , 2000, Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159).

[22]  Renu Raman,et al.  MicroSPARC: a case-study of scan based debug , 1994, Proceedings., International Test Conference.

[23]  David I. August,et al.  Design and evaluation of hybrid fault-detection systems , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[24]  James E. Smith,et al.  Virtual machines - versatile platforms for systems and processes , 2005 .

[25]  Doug Burger,et al.  Exploiting microarchitectural redundancy for defect tolerance , 2003, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[26]  Yervant Zorian,et al.  Principles of testing electronic systems , 2000 .

[27]  Brendan Murphy Automating Software Failure Reporting , 2004, ACM Queue.

[28]  Jared C. Smolens,et al.  Fingerprinting: hash-based error detection in microprocessors , 2007 .

[29]  James E. Smith,et al.  Saving and Restoring Implementation Contexts with co-Designed Virtual Machines , 2001 .

[30]  Daniel J. Sorin,et al.  Core Cannibalization Architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[32]  T. N. Vijaykumar,et al.  Rescue: a microarchitecture for testability and defect tolerance , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[33]  Sule Ozev,et al.  Online diagnosis of hard faults in microprocessors , 2007, TACO.

[34]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[35]  Ismet Bayraktaroglu,et al.  Cache Resident Functional Microprocessor Testing: Avoiding High Speed IO Issues , 2006, 2006 IEEE International Test Conference.

[36]  Subhasish Mitra,et al.  X-compact: an efficient response compaction technique , 2004, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[37]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[38]  Lee Song,et al.  Evaluating ATE features in terms of test escape rates and other cost of test culprits , 2002, Proceedings. International Test Conference.

[39]  Subhasish Mitra Circuit Failure Prediction for Robust System Design , 2008, 2008 IEEE International Integrated Reliability Workshop Final Report.