Aggressive snoop reduction for synchronized producer-consumer communication in energy-efficient embedded multi-processors

Snoop-based cache coherence protocols are typically used when multiple processor cores share memory through a common bus. It is well known, however, that these coherence protocols introduce an excessive power overhead. To help alleviate this problem, we propose an application-driven customization technique where application knowledge regarding data sharing in producer-consumer relationships is used in order to aggressively eliminate unnecessary and predictable snoop-induced cache tag lookups even for references to shared data, thus, achieving significant power reduction with minimal hardware cost. Snoop-induced cache tag lookups for accesses to both shared and private data are eliminated when it is ensured that such lookups will not result in extra knowledge regarding the cache state in respect to the other caches and memories. The proposed methodology relies on the combined support from the compiler, the operating system, and the hardware architecture. Our experiments show average power reductions of more than 80% compared to a general-purpose snoop protocol.

[1]  Thomas F. Wenisch,et al.  Temporal streaming of shared memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[2]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[3]  Per Stenström,et al.  The coherence predictor cache: a resource-efficient and accurate coherence prediction infrastructure , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[4]  Wentai Liu,et al.  Low-power design methodology for an on-chip with adaptive bandwidth capability , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[5]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[6]  Per Stenström,et al.  TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors , 2002, ISLPED '02.

[7]  Luca Benini,et al.  Exploring the energy efficiency of cache coherence protocols in single-chip multi-processors , 2005, GLSVLSI '05.

[8]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[9]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[10]  Andreas Moshovos RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[11]  Babak Falsafi,et al.  JETTY: filtering snoops for reduced energy consumption in SMP servers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.