Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures

To provide high dependability in a multithreaded system despite hardware faults, the system must detect and correct errors in its shared memory system. Recent research has explored dynamic checking of cache coherence as a comprehensive approach to memory system error detection. However, existing coherence checkers are costly to implement, incur high interconnection network traffic overhead, and do not scale well. In this paper, we describe the token coherence signature checker (TCSC), which provides comprehensive, low-cost, scalable coherence checking by maintaining signatures that represent recent histories of coherence events at all nodes (cache and memory controllers). Periodically, these signatures are sent to a verifier to determine if an error occurred. TCSC has a small constant hardware cost per node, independent of cache and memory size and the number of nodes. TCSC's interconnect bandwidth overhead has a constant upper bound and never exceeds 7% in our experiments. TCSC has negligible impact on system performance

[1]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[2]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[3]  Mikko H. Lipasti,et al.  Dynamic Verification of Cache Coherence Protocols , 2004 .

[4]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[5]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[6]  Milo M. K. Martin,et al.  Simulating a $ 2 M Commercial Server on a $ 2 K PC T , 2001 .

[7]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[8]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[9]  Albert Meixner,et al.  Dynamic verification of sequential consistency , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[11]  David A. Wood,et al.  Dynamic verification of end-to-end multiprocessor invariants , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[12]  Jae-Myung Chung,et al.  More generalized Mersenne numbers , 2004 .

[13]  Sebastian Burckhardt,et al.  Verifying Safety of a Token Coherence Implementation by Parametric Compositional Refinement , 2005, VMCAI.

[14]  Albert Meixner,et al.  Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[15]  Mikko H. Lipasti,et al.  Verifying sequential consistency using vector clocks , 2002, SPAA '02.

[16]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[17]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).