Unified Architectural Support for Soft-Error Protection or Software Bug Detection

In this paper we propose a unified architectural support that can be used flexibly for either soft-error protection or software bug detection. Our approach is based on dynamically detecting and enforcing instruction- level invariants. A hardware table is designed to keep track of run-time invariant information. During program execution, instructions access this table and compare their produced results against the stored invariants. Any violation of the predicted invariant suggests a potential abnormal behavior, which could be a result of a soft error or a latent software bug. In case of a soft error, monitoring invariant violations provides opportunistic soft-error protection to multiple structures in processor pipelines. Our experimental results show that invariant violations detect soft errors promptly and as a result, simple pipeline squashing is able to fix most of the detected soft errors. Meanwhile, the same approach can be easily adapted for software bug detection. The proposed architectural support eliminates the substantial performance overhead associated with software-based bug-detection approaches and enables continuous monitoring of production code.

[1]  Sudheendra Hangal,et al.  Tracking down software bugs using automatic anomaly detection , 2002, ICSE '02.

[2]  Mikko H. Lipasti,et al.  Value locality and load value prediction , 1996, ASPLOS VII.

[3]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[4]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[5]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[6]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[7]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[8]  James E. Smith,et al.  The predictability of data values , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[9]  T. N. Vijaykumar,et al.  Opportunistic transient-fault detection , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10]  G.S. Sohi,et al.  Dynamic instruction reuse , 1997, ISCA '97.

[11]  Evan Marcus,et al.  Blueprints for high availability , 2000 .

[12]  William G. Griswold,et al.  Dynamically discovering likely program invariants to support program evolution , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[13]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[14]  Monica S. Lam,et al.  Enhancing software reliability with speculative threads , 2002, ASPLOS X.

[15]  Yuanyuan Zhou,et al.  BugBench: Benchmarks for Evaluating Bug Detection Tools , 2005 .

[16]  Huiyang Zhou,et al.  Locality-Based Information Redundancy for Processor Reliability , 2007 .

[17]  William G. Griswold,et al.  Quickly detecting relevant program invariants , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[18]  Wei Liu,et al.  AccMon: Automatically Detecting Memory-Related Bugs via Program Counter-Based Invariants , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[19]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[20]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[21]  Evan Marcus,et al.  Blueprints for high availability: designing resilient distributed systems , 2000 .