Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors

Integrating more processor cores on-die has become the unanimous trend in the microprocessor industry. Most of the current research thrusts using chip multiprocessors (CMPs) as the baseline to analyze problems in various domains. One of the main design issues facing CMP systems is the growing number of snoops required to maintain cache coherency and to support self/cross-modifying code that leads to power and performance limitations. In this paper, we analyze the internal and external snoop behavior in a CMP system and relax the snoopy cache coherence protocol based on the program semantics and properties of the shared variables for saving power. Based on the observations and analyses, we propose two novel techniques: Selective Snoop Probe (SSP) and Essential Snoop Probe (ESP) to reduce power without compromising performance. Our simulation results show that using the SSPtechnique, 5% to 65% data cache energy savings per core for different processor configurations can be achieved with 1% to 2% performance improvement. We also show that 5% to 82% of data cache energy per core is spent on the non-essential snoop probes that can be saved using the ESP technique.

[1]  M. Smelyanskiy,et al.  Stack value file: custom microarchitecture for the stack , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[2]  P. Stenstrom,et al.  TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors , 2002, Proceedings of the International Symposium on Low Power Electronics and Design.

[3]  Mark S. Squillante,et al.  Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling , 1993, IEEE Trans. Parallel Distributed Syst..

[4]  Simha Sethumadhavan,et al.  Late-binding: enabling unordered load-store queues , 2007, ISCA '07.

[5]  Mikko H. Lipasti,et al.  Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[6]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[7]  Ronak Singhal,et al.  Performance Analysis and Validation of the Intel Pentium 4 Processor on 90nm Technology , 2004 .

[8]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[9]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[10]  Babak Falsafi,et al.  JETTY: filtering snoops for reduced energy consumption in SMP servers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[11]  Hsien-Hsin S. Lee,et al.  An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[12]  Andreas Moshovos RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[13]  Per Stenström,et al.  Evaluation of Snoop-Energy Reduction Techniques for Chip-Multiprocessors , 2002, ISCA 2002.

[14]  D. Novillo OpenMP and automatic parallelization in GCC Diego , 2006 .

[15]  Eric Dahlen,et al.  The 82460GX Sever/Workstation Chip Set , 2000, IEEE Micro.

[16]  Peter Petrov,et al.  Energy-Efficient Cache Coherence for Embedded Multi-Processor Systems through Application-Driven Snoop Filtering , 2006, 9th EUROMICRO Conference on Digital System Design (DSD'06).

[17]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[18]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[19]  Mikko H. Lipasti,et al.  Power-Efficient Cache Coherence , 2004 .

[20]  Wen-Hann Wang,et al.  On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.

[21]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[22]  Gary S. Tyson,et al.  Region-based caching: an energy-delay efficient memory architecture for embedded processors , 2000, CASES '00.

[23]  Hsien-Hsin S. Lee,et al.  Energy efficient D-TLB and data cache using semantic-aware multilateral partitioning , 2003, ISLPED '03.

[24]  Shreekant S. Thakkar,et al.  Multiprocessor validation of the Pentium Pro microprocessor , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.

[25]  Xin-Min Tian,et al.  Intel OpenMP C++/Fortran Compiler for Hyper-Threading Technology: Implementation and Performance , 2002 .

[26]  Amir Roth,et al.  Store vulnerability window (SVW): re-execution filtering for enhanced load optimization , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[27]  Avi Mendelson,et al.  CMP Implementation in Systems Based on the Intel Core Duo Processor , 2006 .

[28]  Alon Naveh,et al.  Power and Thermal Management in the Intel Core Duo Processor , 2006 .

[29]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[30]  Hsien-Hsin S. Lee,et al.  Efficient System-on-Chip Energy Management with a Segmented Bloom Filter , 2006, ARCS.

[31]  Michael Stumm,et al.  A performance comparison of hierarchical ring- and mesh-connected multiprocessor networks , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.