A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Current on-chip block-centric memory hierarchies exploit access patterns at the fine-grain scale of small blocks. Several recently proposed techniques for coherence traffic reduction and prefetching suggest that further useful patterns emerge with a macroscopic, coarse-grain view. To exploit coarse- grain behavior, previous work extended conventional caches with additional coarse-grain tracking and management structures considerably increasing overall cost and complexity. This paper demonstrates that as multi-megabyte caches have become commonplace, coarse-grain tracking and management no longer needs to be an afterthought. This functionality comes "for free" via RegionTracker. RegionTracker is a dual-grain cache design that maintains block-level communication while directly supporting coarse-grain tracking and management. Compared to a block-centric conventional cache of the same data capacity, RegionTracker requires less area to achieve a nearly identical miss rate (within 1%). RegionTracker can be used as the building block for coarse-grain optimizations, reducing their overall cost and easing their adoption. Using full-system simulation of a quad-core chip multiprocessor, commercial workloads, and area estimates based on full-custom layouts on a 130 nm commercial technology, we demonstrate the performance and cost viability of the RegionTracker design. We also demonstrate the potential of RegionTracker as a framework for coarse-grain optimizations by showing that it boosts the benefits and reduces the cost of a previously proposed snoop reduction technique.

[1]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[2]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[3]  A. Seznec,et al.  Decoupled sectored caches: conciliating low tag implementation cost and low miss ratio , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[4]  Andreas Moshovos RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence , 2005, ISCA 2005.

[5]  Mikko H. Lipasti,et al.  Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[6]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[7]  J. ContiC.,et al.  Structural aspects of the system/360 model 85 , 1968 .

[8]  Shih-Lien Lu,et al.  Efficient Victim Mechanism on Sector Cache Organization , 2004, Asia-Pacific Computer Systems Architecture Conference.

[9]  Jeffrey B. Rothman,et al.  The pool of subsectors cache design , 1999, ICS '99.

[10]  Per Stenström,et al.  TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors , 2002, ISLPED '02.

[11]  Andreas Moshovos RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[12]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[13]  Roland E. Wunderlich,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[14]  Thomas F. Wenisch,et al.  Temporal streaming of shared memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[15]  Mikko H. Lipasti,et al.  Stealth prefetching , 2006, ASPLOS XII.

[16]  Per Stenström,et al.  Enhancing Multiprocessor Architecture Simulation Speed Using Matched-Pair Comparison , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[17]  Thomas F. Wenisch,et al.  SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture , 2004, PERV.

[18]  John S. Liptay,et al.  Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..

[19]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).