An Online Mechanism to Verify Datapath Execution Using Existing Resources in Chip Multiprocessors

With scaling of process technology, transistor and interconnect reliability has emerged as a growing concern for modern microprocessors. Traditional solutions for reliable operation rely on double or triple modular redundancies. However, chip multiprocessors (CMP) provide unique opportunity for low-cost data path verification for reliable operation. A recent paper presents a fault recovery scheme based on outsourcing instructions from identified faulty cores to fault free cores capable of executing them. The communication between the cores is managed via an inter-core queue (ICQ). However, no faulty core identification mechanism was presented. In this paper, we extend this research to enable self-test of the data path execution in a multicore processor. Specifically, whenever instructions are retired locally on a core (local), they are also dispatched for execution on another nearby (remote) core for execution verification via ICQ. Results obtained from local and remote cores are compared. If a fault is detected, the instruction may be re-executed on both local and remote cores to distinguish between hard and soft faults. In this study, we present results on frequency of coverage and latency between first execution and its verification. We also report performance impact of execution verification on the remote core. Results indicate that the proposed scheme is capable of remotely verifying ~80% integer ALU instructions and >98% of other instruction types with very small impact on performance of just ~1% on the tester core and incurs less than 1% area overhead.

[1]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[2]  Mihalis Psarakis,et al.  Instruction-Based Online Periodic Self-Testing of Microprocessors with Floating-Point Units , 2009, IEEE Transactions on Dependable and Secure Computing.

[3]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[4]  Bernd Becker,et al.  Power Droop Testing , 2007, IEEE Design & Test of Computers.

[5]  Algirdas Avizienis,et al.  Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design , 1971, IEEE Transactions on Computers.

[6]  Sule Ozev,et al.  Tolerating hard faults in microprocessor array structures , 2004, International Conference on Dependable Systems and Networks, 2004.

[7]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[8]  Onur Mutlu,et al.  Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[9]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[10]  James Tschanz,et al.  A Low Cost Scheme for Reducing Silent Data Corruption in Large Arithmetic Circuits , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[11]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[12]  Omer Khan,et al.  Improving yield and reliability of chip multiprocessors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[13]  Osman S. Unsal,et al.  Fuse: A Technique to Anticipate Failures due to Degradation in ALUs , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[14]  Todd M. Austin,et al.  Ultra low-cost defect protection for microprocessor pipelines , 2006, ASPLOS XII.

[15]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[16]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[17]  Chandra Tirumurti,et al.  On modeling crosstalk faults , 2005, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18]  Glen G. Langdon,et al.  Concurrent error detection for group look-ahead binary adders , 1970 .

[19]  Subhasish Mitra,et al.  Delay defect characteristics and testing strategies , 2003, IEEE Design & Test of Computers.

[20]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[21]  Sarita V. Adve,et al.  Trace-based microarchitecture-level diagnosis of permanent hardware faults , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[22]  Edward J. McCluskey,et al.  ED4I: Error Detection by Diverse Data and Duplicated Instructions , 2002, IEEE Trans. Computers.