Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Destination-set prediction can improve the latency/bandwidth tradeoff in shared-memory multiprocessors. The destination set is the collection of processors that receive a particular coherence request. Snooping protocols send requests to the maximal destination set (i.e., all processors), reducing latency for cache-to-cache misses at the expense of increased traffic. Directory protocols send requests to the minimal destination set, reducing bandwidth at the expense of an indirection through the directory for cache-to-cache misses. Recently proposed hybrid protocols trade-off latency and bandwidth by directly sending requests to a predicted destination set.This paper explores the destination-set predictor design space, focusing on a collection of important commercial workloads. First, we analyze the sharing behavior of these workloads. Second, we propose predictors that exploit the observed sharing behavior to target different points in the latency/bandwidth tradeoff. Third, we illustrate the effectiveness of destination-set predictors in the context of a multicast snooping protocol. For example, one of our predictors obtains almost 90% of the performance of snooping while using only 15% more bandwidth than a directory protocol (and less than half the bandwidth of snooping).

[1]  Min Xu,et al.  Evaluating Non-deterministic Multi-threaded Commercial Workloads , 2001 .

[2]  D.A. Wood,et al.  Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[3]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[4]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[5]  Fredrik Dahlgren Boosting the performance of hybrid snooping cache protocols , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[6]  Laxmi N. Bhuyan,et al.  Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors , 1992, IEEE Trans. Parallel Distributed Syst..

[7]  Jim Nilsson,et al.  Improving performance of load-store sequences for transaction processing workloads on multiprocessors , 1999, Proceedings of the 1999 International Conference on Parallel Processing.

[8]  Anna R. Karlin,et al.  Competitive snoopy caching , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[9]  David J. Lilja,et al.  The Potential of Compile-Time Analysis to Adapt the Cache Coherence Enforcement Strategy to the Data Sharing Characteristics , 1995, IEEE Trans. Parallel Distributed Syst..

[10]  Anna R. Karlin,et al.  Two adaptive hybrid cache coherency protocols , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[11]  José González,et al.  Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[12]  BarrosoLuiz Andre,et al.  Memory system characterization of commercial workloads , 1998 .

[13]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[14]  Sarita V. Adve,et al.  Performance of database workloads on shared-memory systems with out-of-order processors , 1998, ASPLOS VIII.

[15]  Josep Torrellas,et al.  Distance-adaptive update protocols for scalable shared-memory multiprocessors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[16]  Milo M. K. Martin,et al.  Bandwidth adaptive snooping , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[17]  Erik Hagersten,et al.  WildFire: a scalable path for SMPs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[18]  Milo M. K. Martin,et al.  Simulating a $ 2 M Commercial Server on a $ 2 K PC T , 2001 .

[19]  Stefanos Kaxiras,et al.  Improving CC-NUMA performance using Instruction-based Prediction , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[20]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[21]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[22]  David A. Wood,et al.  Full-system timing-first simulation , 2002, SIGMETRICS '02.

[23]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[24]  K. Gharachorloo,et al.  Architecture and design of AlphaServer GS320 , 2000, ASPLOS IX.

[25]  Mark D. Hill,et al.  Using prediction to accelerate coherence protocols , 1998, ISCA.

[26]  Steven R. Kunkel,et al.  System optimization for OLTP workloads , 1999, IEEE Micro.

[27]  David A. Wood,et al.  Multicast snooping: a new coherence method using a multicast address network , 1999, ISCA.

[28]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[29]  Milo M. K. Martin,et al.  Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol , 2002, IEEE Trans. Parallel Distributed Syst..

[30]  Jim Nilsson,et al.  Reducing ownership overhead for load-store sequences in cache-coherent multiprocessors , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[31]  Babak Falsafi,et al.  Selective, accurate, and timely self-invalidation using last-touch prediction , 2000, ISCA '00.

[32]  Sigarch Proceedings 30th Annual International Symposium on Computer Architecture , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[33]  Willy Zwaenepoel,et al.  Adaptive software cache management for distributed shared memory architectures , 1990, ISCA '90.

[34]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[35]  Josep Torrellas,et al.  The memory performance of DSS commercial workloads in shared-memory multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[36]  Stefanos Kaxiras,et al.  Coherence communication prediction in shared-memory multiprocessors , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[37]  José González,et al.  The use of prediction for accelerating upgrade misses in cc-NUMA multiprocessors , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[38]  Babak Falsafi,et al.  Memory sharing predictor: the key to a speculative coherent DSM , 1999, ISCA.

[39]  Robert J. Fowler,et al.  Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.