Reducing ownership overhead for load-store sequences in cache-coherent multiprocessors

Parallel programs that modify shared data in a cache-coherent multiprocessor with a write-invalidate coherence protocol create ownership overhead in the form of ownership acquisitions at writes to shared data. This can have a significant impact on performance in a cache-coherent non-uniform memory architecture (NUMA) multiprocessor. By combining a read-request and an ownership acquisition, the write latency and network traffic can potentially be reduced. In this paper we propose a new hardware-based approach far performing this optimization by targeting load-store sequences, which we show is a super-set of migrator sharing. A load-store sequence consists of a global read request followed by a global write action to the same memory, location from the same processor without any intervening access to the same block from any other processor. We use detailed simulation with four benchmark programs including one on-line transaction processing (OLTP) workload and operating system execution to examine the effectiveness of the proposed technique. The results show that the technique is able to reduce write-related latency and network traffic more than previous hardware-based techniques, up to twice as much.

[1]  Ann Marie Grizzaffi Maynard,et al.  Contrasting characteristics and cache performance of technical and multi-user commercial workloads , 1994, ASPLOS VI.

[2]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[3]  Shreekant S. Thakkar,et al.  The Symmetry Multiprocessor System , 1988, ICPP.

[4]  Stefanos Kaxiras,et al.  Improving CC-NUMA performance using Instruction-based Prediction , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[5]  Håkan Grahn,et al.  SimICS/Sun4m: A Virtual Workstation , 1998, USENIX Annual Technical Conference.

[6]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[7]  Robert J. Fowler,et al.  Adaptive Cache Coherency For Detecting Migratory Shared Data , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[8]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[9]  Livio Ricciulli,et al.  The detection and elimination of useless misses in multiprocessors , 1993, ISCA '93.

[10]  Sarita V. Adve,et al.  Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.

[11]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[12]  Michel Dubois,et al.  Cache and Interconnect Architectures in Multiprocessors , 1990, Springer US.

[13]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[14]  Per Stenström,et al.  Using dataflow analysis techniques to reduce ownership overhead in cache coherence protocols , 1996, TOPL.

[15]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[16]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[17]  Jim Nilsson,et al.  Improving performance of load-store sequences for transaction processing workloads on multiprocessors , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[18]  Mark D. Hill,et al.  Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.

[19]  T. N. Vijaykumar,et al.  Is SC + ILP = RC? , 1999, ISCA.

[20]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[21]  Håkan Grahn,et al.  Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection , 1996, J. Parallel Distributed Comput..