Subspace snooping: Filtering snoops with operating system support

Although snoop-based coherence protocols provide fast cache-to-cache transfers with a simple and robust coherence mechanism, scaling the protocols has been difficult due to the overheads of broadcast snooping. In this paper, we propose a coherence filtering technique called subspace snooping, which stores the potential sharers of each memory page in the page table entry. By using the sharer information in the page table entry, coherence transactions for a page generate snoop requests only to the subset of nodes in the system (subspace). However, the coherence subspace of a page may evolve, as the phases of applications may change or the operating system may migrate threads to different nodes. To adjust subspaces dynamically, subspace snooping supports a shrinking mechanism, which removes obsolete nodes from subspaces. Subspace snooping can be integrated to any type of coherence protocols and network topologies. As subspace snooping guarantees that a subspace always contains the precise sharers of a page, it does not restrict the designs of coherence protocols and networks. We evaluate subspace snooping with Token Coherence on un-ordered mesh networks. For scientific and server applications on a 16-core system, subspace snooping reduces 44% of snoops on average.

[1]  Babak Falsafi,et al.  JETTY: filtering snoops for reduced energy consumption in SMP servers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[2]  Faye A. Briggs,et al.  Scalability port: a coherent interface for shared memory multiprocessors , 2002, Proceedings 10th Symposium on High Performance Interconnects.

[3]  Milo M. K. Martin,et al.  Bandwidth adaptive snooping , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[4]  Rajeev Balasubramonian,et al.  Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[5]  Pat Conway,et al.  The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.

[6]  Marcelo Cintra,et al.  An OS-based alternative to full hardware coherence on tiled CMPs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[7]  Milo M. K. Martin,et al.  Token tenure: PATCHing token counting using directory-based cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[8]  Niraj K. Jha,et al.  In-Network Coherence Filtering: Snoopy coherence without broadcasts , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  David A. Wood,et al.  Multicast snooping: a new coherence method using a multicast address network , 1999, ISCA.

[10]  A. Charlesworth The Sun Fireplane System Interconnect , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[11]  Milo M. K. Martin,et al.  Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[12]  Natalie D. Enright Jerger,et al.  Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[13]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[14]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[15]  Niraj K. Jha,et al.  In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[16]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[17]  Milo M. K. Martin,et al.  Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors , 2003, ISCA '03.

[18]  Mikko H. Lipasti,et al.  Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[19]  Per Stenström,et al.  TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors , 2002, ISLPED '02.

[20]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[21]  References , 1971 .

[22]  Andreas Moshovos RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[23]  Josep Torrellas,et al.  Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[24]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25]  Mark D. Hill,et al.  Coherence Ordering for Ring-based Chip Multiprocessors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[26]  Michel Cekleov,et al.  XDBus: a high-performance, consistent, packet-switched VLSI bus , 1993, Digest of Papers. Compcon Spring.

[27]  Patricia J. Teller Translation-lookaside buffer consistency , 1990, Computer.

[28]  Andreas Moshovos,et al.  RegionTracker : A Case for Dual-Grain Tracking in the Memory System , 2006 .

[29]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[30]  Li Shang,et al.  In-Network Cache Coherence , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[31]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.