Dynamic verification of end-to-end multiprocessor invariants

As implementations of shared memory multiprocessors become more complicated, hardware faults will increasingly cause errors that are difficult or impossible to detect with low-level, localized mechanisms. In this paper, we argue for dynamic verification (i.e., on-the-fly checking) of end-to-end, system-wide invariants in shared memory multiprocessors. We develop two invariant checkers based on distributed signature analysis. Our coherence-level checker dynamically verifies that every cache coherence upgrade has a corresponding downgrade elsewhere in the system. Our messagelevel checker verifies that all nodes in an SMP observe the same total order of broadcast requests. We use full-system simulation to show that the checkers detect the targeted errors while not significantly degrading system performance.

[1]  Mikko H. Lipasti,et al.  Verifying sequential consistency using vector clocks , 2002, SPAA '02.

[2]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[3]  Erik Hagersten,et al.  Gigaplane: A High Performance Bus for Large SMPs , 2003 .

[4]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[5]  Mikko H. Lipasti,et al.  Dynamic Verification of Cache Coherence Protocols , 2004 .

[6]  Solomon W. Golomb,et al.  Shift Register Sequences , 1981 .

[7]  Melvin A. Breuer,et al.  Digital systems testing and testable design , 1990 .

[8]  Alan E. Charlesworth,et al.  Starfire: extending the SMP envelope , 1998, IEEE Micro.

[9]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[10]  Milo M. K. Martin,et al.  Simulating a $ 2 M Commercial Server on a $ 2 K PC T , 2001 .

[11]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[12]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[13]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[14]  David A. Wood,et al.  Using lightweight checkpoint/recovery to improve the availability and designability of shared memory multiprocessors , 2002 .

[15]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[16]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[17]  Hu Chuan-Gan,et al.  On The Shift Register Sequences , 2004 .