论文信息 - A dual grain hit-miss detector for large Die-Stacked DRAM caches

A dual grain hit-miss detector for large Die-Stacked DRAM caches

Die-Stacked DRAM caches offer the promise of improved performance and reduced energy by capturing a larger fraction of an application's working set than on-die SRAM caches. However, given that their latency is only 50% lower than that of main memory, DRAM caches considerably increase latency for misses. They also incur a significant energy overhead for remote lookups in snoop-based multi-socket systems. Ideally, it would be possible to detect in advance that a request will miss in the DRAM cache and thus selectively bypass it. This work proposes a “dual grain filter” which successfully predicts whether an access is a hit or a miss in most cases. Experimental results with commercial and scientific workloads show that a 158KB dual-grain filter can correctly predict data block residency for 85% of all accesses to a 256MB DRAM cache. As a result, average off-die latency with our filter is within 8% of that possible with a perfectly accurate filter, which is impractical to implement.

[1] Gabriel H. Loh,et al. A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[2] Babak Falsafi,et al. JETTY: filtering snoops for reduced energy consumption in SMP servers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[3] Glenn Reinman,et al. Just say no: benefits of early cache miss determination , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[4] Thomas F. Wenisch,et al. SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture , 2004, PERV.

[5] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .

[6] Hsien-Hsin S. Lee,et al. Way guard: a segmented counting bloom filter approach to reducing energy for set-associative caches , 2009, ISLPED.

[7] Andreas Moshovos,et al. RegionTracker : Using Dual-Grain Tracking for Energy Efficient Cache Lookup , 2003 .

[8] Gabriel H. Loh,et al. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[9] Andreas Moshovos. RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[11] Mark D. Hill,et al. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12] Yan Solihin,et al. CHOP: Integrating DRAM Caches for CMP Server Platforms , 2011, IEEE Micro.

[13] Andreas Moshovos,et al. A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).