论文信息 - A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Current on-chip block-centric memory hierarchies exploit access patterns at the fine-grain scale of small blocks. Several recently proposed techniques for coherence traffic reduction and prefetching suggest that further useful patterns emerge with a macroscopic, coarse-grain view. To exploit coarse- grain behavior, previous work extended conventional caches with additional coarse-grain tracking and management structures considerably increasing overall cost and complexity. This paper demonstrates that as multi-megabyte caches have become commonplace, coarse-grain tracking and management no longer needs to be an afterthought. This functionality comes "for free" via RegionTracker. RegionTracker is a dual-grain cache design that maintains block-level communication while directly supporting coarse-grain tracking and management. Compared to a block-centric conventional cache of the same data capacity, RegionTracker requires less area to achieve a nearly identical miss rate (within 1%). RegionTracker can be used as the building block for coarse-grain optimizations, reducing their overall cost and easing their adoption. Using full-system simulation of a quad-core chip multiprocessor, commercial workloads, and area estimates based on full-custom layouts on a 130 nm commercial technology, we demonstrate the performance and cost viability of the RegionTracker design. We also demonstrate the potential of RegionTracker as a framework for coarse-grain optimizations by showing that it boosts the benefits and reduces the cost of a previously proposed snoop reduction technique.

Andreas Moshovos | Jason Zebchuk | Elham Safi

[1] Anoop Gupta,et al. Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[2] Balaram Sinharoy,et al. POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[3] A. Seznec,et al. Decoupled sectored caches: conciliating low tag implementation cost and low miss ratio , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[4] Andreas Moshovos. RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence , 2005, ISCA 2005.

[5] Mikko H. Lipasti,et al. Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[6] Cameron McNairy,et al. Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[7] J. ContiC.,et al. Structural aspects of the system/360 model 85 , 1968 .

[8] Shih-Lien Lu,et al. Efficient Victim Mechanism on Sector Cache Organization , 2004, Asia-Pacific Computer Systems Architecture Conference.

[9] Jeffrey B. Rothman,et al. The pool of subsectors cache design , 1999, ICS '99.

[10] Per Stenström,et al. TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors , 2002, ISLPED '02.

[11] Andreas Moshovos. RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[12] Thomas F. Wenisch,et al. Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[13] Roland E. Wunderlich,et al. SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[14] Thomas F. Wenisch,et al. Temporal streaming of shared memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[15] Mikko H. Lipasti,et al. Stealth prefetching , 2006, ASPLOS XII.

[16] Per Stenström,et al. Enhancing Multiprocessor Architecture Simulation Speed Using Matched-Pair Comparison , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[17] Thomas F. Wenisch,et al. SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture , 2004, PERV.

[18] John S. Liptay,et al. Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..

[19] Luiz André Barroso,et al. Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).