Exploiting Coarse Grain Non-Shared Regions in Snoopy Coherent Multiprocessors

It has been shown that many requests miss in all remote nodes in shared memory multiprocessors. We are motivated by the observation that this behavior extends to much coarser grain areas of memory. We define a region to be a continuous, aligned memory area whose size is a power of two and observe that many requests find that no other node caches a block in the same region even for regions as large as 16K bytes (it has already been known that this phenomenon applies to the special cases of a block or a page). We propose RegionScout, a family of simple filter mechanisms that dynamically detect most non-shared regions. A node with a RegionScout filter can determine in advance that a request will miss in all remote nodes. RegionScout filters are implemented as a layered extension over existing snoop-based coherence systems. They require no changes to existing coherence protocols or caches and impose no constraints on what can be cached simultaneously. Their operation is completely transparent to software and the operating system. RegionScout filters require little additional storage and a single additional global signal. These characteristics are made possible by utilizing imprecise information about the regions cached in each node. Since they rely on dynamically collected information RegionScout filters can adapt to changing sharing patterns. We present two applications of RegionScout: In the first RegionScout is used to avoid broadcasts for non-shared regions thus reducing bandwidth. In the second RegionScout is used to avoid snoop induced tag lookups thus reducing energy.

[1]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[2]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[3]  Babak Falsafi,et al.  Memory sharing predictor: the key to a speculative coherent DSM , 1999, ISCA.

[4]  Melvin A. Breuer,et al.  Digital Systems Testing & Testable Design , 1993 .

[5]  Naraig Manjikian Multiprocessor enhancements of the SimpleScalar tool set , 2001, CARN.

[6]  Per Stenström,et al.  Evaluation of Snoop-Energy Reduction Techniques for Chip-Multiprocessors , 2002, ISCA 2002.

[7]  Kunle Olukotun,et al.  The Stanford Hydra CMP , 2000, IEEE Micro.

[8]  David A. Wood,et al.  Cost-Effective Parallel Computing , 1995, Computer.

[9]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[10]  Babak Falsafi,et al.  JETTY: filtering snoops for reduced energy consumption in SMP servers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[11]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[12]  Milo M. K. Martin,et al.  Bandwidth adaptive snooping , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[13]  Mikko H. Lipasti,et al.  Power-Efficient Cache Coherence , 2004 .

[14]  William J. Dally,et al.  Digital systems engineering , 1998 .

[15]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[16]  Stefanos Kaxiras,et al.  Coherence communication prediction in shared-memory multiprocessors , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[17]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[18]  Per Stenström,et al.  TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors , 2002, ISLPED '02.

[19]  Shubhendu S. Mukherjee,et al.  Using prediction to accelerate coherence protocols , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[20]  Proceedings Eighth International Symposium on High Performance Computer Architecture , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[21]  Jaehyuk Huh,et al.  Exploring the design space of future CMPs , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[22]  Per Stenström,et al.  Coherence Predictor Cache: A Resource Efficient Coherence Message Prediction Infrastructure. , 2003 .