[2009] A Stage-Level Recovery Scheme in Scalable Pipeline Modules for High Dependability

In the recent years, the increasing error rate has become one of the major impediments for the application of new process technologies in electronic devices like microprocessors. This thereby necessitates the research of fault toleration mechanisms from all device, micro-architecture and system levels to keep correct computation in future microprocessors, along the advances of process technologies.Space redundancy, as dual or triple modular redundancy (DMR or TMR), is widely used to tolerate errors with a negligible performance loss. In this paper, at the micro-architecture level, we propose a very fine-grained recovery scheme based on a DMR processor architecture to cover every transient error inside of the memory interface boundary. Our recovery method makes full use of the existing duplicated hardware in the DMR processor, which can avoid large hardware extension by not using checkpoint buffers in many fault-tolerable processors. The hardware-based recovery is achieved by dynamically triggering an instruction re-execution procedure in the next cycle after error detection, which indicates a near-zero performance impact to achieve an error-free execution.A TMR architecture is usually preferred as it provides a seamless error correction by a majority voting logic and therefore generates no recovery delay. With our fast recovery scheme at a low hardware cost, our result shows that even under a relatively high transient error rate, it is possible to only use a DMR architecture to detect/recover errors at a negligible performance cost. Our reliable processor is thus constructed to use a DMR execution with the fast recovery as its major working mode. It saves around 1/3 energy consumption from a traditional TMR architecture, while the transient error coverage is still maintained.

[1]  N. Seifert,et al.  Robust system design with built-in soft-error resilience , 2005, Computer.

[2]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[3]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[4]  Babak Falsafi,et al.  Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[5]  C. L. Chen,et al.  APPENDIX A – Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1992 .

[6]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[7]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[8]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[9]  Daniel P. Siewiorek,et al.  Reliable Computer Systems: Design and Evaluation, Third Edition , 1998 .

[10]  P.N. Sanda,et al.  IBM z990 soft error detection and recovery , 2005, IEEE Transactions on Device and Materials Reliability.

[11]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[12]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[13]  Onur Mutlu,et al.  Microarchitecture-based introspection: a technique for transient-fault tolerance in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[14]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[15]  Xiaodong Li,et al.  SoftArch: an architecture-level tool for modeling and analyzing soft errors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[16]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.