Clouseau: Probabilistic Dynamic Verification of Multithreaded Memory Systems

Dynamic verification enables a system to improve its availability by checking that its execution is correct as it is running. While high performance and low power are desirable, correctness— despite hardware faults and subtle design bugs—is most important. For multithreaded systems, memory system correctness is defined by the memory consistency model. Thus, dynamically verifying memory consistency would ensure that the entire memory system is operating correctly. We present the first implementable design for probabilistic dynamic verification of sequential consistency (pDVSC) in multithreaded systems. The system dynamically creates a total order of memory operations (loads and stores) and verifies that this total order obeys SC. In the theoretical world of systems without resource constraints, DVSC would have to consider the entire total order, but we show how to leverage resource constraints to verify only a sliding window of the total order. While we cannot bound the size of this window and still eliminate all false verifications (false positives or negatives), we can implement probabilistic verification and make the probability of false verification arbitrarily small. We use full-system simulation of a multithreaded system running commercial workloads to evaluate our first implementation of pDVSC, called Clouseau. Clouseau’s implementation costs are kept reasonable via extensive compression and caching of the data that is used for dynamic verification. Clouseau, combined with backward error recovery, improves availability by recovering from injected errors. Clouseau adds only negligible performance overhead. While Clouseau adds to system design complexity, we believe this is a small price to pay for improving system availability.

[1]  IEEE Transactions on Parallel and Distributed Systems, Vol. 13 , 2002 .

[2]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[3]  Shubhendu S. Mukherjee,et al.  The Alpha 21364 network architecture , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[4]  Mikko H. Lipasti,et al.  Verifying sequential consistency using vector clocks , 2002, SPAA '02.

[5]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[6]  Mark D. Hill,et al.  Lamport clocks: verifying a directory cache-coherence protocol , 1998, SPAA '98.

[7]  P. Fenwick Punctured Elias Codes for Variable-Length Coding of the Integers , 1996 .

[8]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[9]  Alan J. Hu,et al.  Automatable Verification of Sequential Consistency , 2003, Theory of Computing Systems.

[10]  Milo M. K. Martin,et al.  Timestamp snooping: an approach for extending SMPs , 2000, ASPLOS.

[11]  Mikko H. Lipasti,et al.  Dynamic Verification of Cache Coherence Protocols , 2004 .

[12]  Arvin Park,et al.  Dynamic Base Register Caching: A Technique for Reducing Address Bus Width , 1991, ISCA.

[13]  Paul F. Reynolds,et al.  Isotach Networks , 1997, IEEE Trans. Parallel Distributed Syst..

[14]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[15]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[16]  Milo M. K. Martin,et al.  Simulating a $ 2 M Commercial Server on a $ 2 K PC T , 2001 .

[17]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[18]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[19]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[20]  Phillip B. Gibbons,et al.  Testing Shared Memories , 1997, SIAM J. Comput..

[21]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[22]  Kang G. Shin,et al.  Scalable Hardware Priority Queue Architectures for High-Speed Packet Switches , 2000, IEEE Trans. Computers.

[23]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[24]  Milo M. K. Martin,et al.  Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol , 2002, IEEE Trans. Parallel Distributed Syst..

[25]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.