Fingerprinting: bounding soft-error-detection latency and bandwidth

Recent studies suggest that the soft-error rate in microprocessor logic is likely to become a serious reliability concern by 2010. Detecting soft errors in the processor's core logic presents a new challenge beyond what error detecting and correcting codes can handle. Commercial microprocessor systems that require an assurance of reliability employ an error-detection scheme based on dual modular redundancy (DMR) in some form - from replicated pipelines within the same die to mirroring of complete processors. To detect errors across a distributed DMR pair, we develop fingerprinting, a technique that summarizes a processor's execution history into a cryptographic signature, or "fingerprint". More specifically, a fingerprint is a hash value computed on the changes to a processor's architectural state resulting from a program's execution. Fingerprinting summarizes the history of internal processor state updates into a cryptographic signature. The processors in a dual modular redundant pair periodically exchange and compare fingerprints to corroborate each other's correctness. Relative to other techniques, fingerprinting offers superior error coverage and significantly reduces the error-detection latency and bandwidth

[1]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[2]  Daniel P. Siewiorek,et al.  Reliable Computer Systems: Design and Evaluation, Third Edition , 1998 .

[3]  John Paul Shen,et al.  Continuous signature monitoring: low-cost concurrent detection of processor control errors , 1990, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[4]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[5]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[6]  Prashant J. Shenoy,et al.  Rules of thumb in data engineering , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[7]  Kevin Reick,et al.  Power4 System Design for High Reliability , 2002, IEEE Micro.

[8]  Jacob A. Abraham,et al.  FERRARI: a tool for the validation of system dependability properties , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[9]  Jack K. Wolf,et al.  On the Probability of Undetected Error for Linear Block Codes , 1982, IEEE Trans. Commun..

[10]  Haitham Akkary,et al.  Checkpoint processing and recovery: towards scalable large instruction window processors , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[11]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[12]  Babak Falsafi,et al.  Speculative sequential consistency with little custom storage , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[13]  Mahmud Adnan,et al.  An Overview of Advanced Failure Analysis Techniques for Pentium and Pentium Pro Microprocessors 1 An Overview of Advanced Failure Analysis Techniques for Pentium and Pentium Pro Microprocessors , 1998 .

[14]  David A. Wood,et al.  Dynamic verification of end-to-end multiprocessor invariants , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[15]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[16]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[17]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[18]  Adit D. Singh,et al.  Early error detection in systems-on-chip for fault-tolerance and at-speed debugging , 2001, Proceedings 19th IEEE VLSI Test Symposium. VTS 2001.

[19]  I.F. Blake,et al.  Introduction to the theory of error-correcting codes , 1984, Proceedings of the IEEE.

[20]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[21]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[22]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[23]  Todd M. Austin,et al.  Dynamic dependency analysis of ordinary programs , 1992, ISCA '92.

[24]  Huntington W. Curtis,et al.  Accelerated testing for cosmic soft-error rate , 1996, IBM J. Res. Dev..

[25]  T. Juhnke,et al.  Calculation of the Soft Error Rate of Submicron CMOS Logic Circuits , 1994, ESSCIRC '94: Twientieth European Solid-State Circuits Conference.

[26]  T. N. Vijaykumar,et al.  Is SC + ILP = RC? , 1999, ISCA.

[27]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multi-threading alternatives , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[28]  Dennis McEvoy The architecture of Tandem's NonStop system , 1981, ACM '81.

[29]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[30]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[31]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[32]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[33]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[34]  John M. Mellor-Crummey,et al.  FIAT: A Framework for Interprocedural Analysis and Transfomation , 1993, LCPC.

[35]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.