论文信息 - A Comprehensive Analytical Performance Model of DRAM Caches

A Comprehensive Analytical Performance Model of DRAM Caches

Stacked DRAM promises to offer unprecedented capacity, and bandwidth to multi-core processors at moderately lower latency than off-chip DRAMs. A typical use of this abundant DRAM is as a large last level cache. Prior research works are divided on how to organize this cache and the proposed organizations fall into one of two categories: (i) as a Tags-In-DRAM organization with the cache organized as small blocks (typically 64B) and metadata (tags, valid, dirty, recency and coherence bits) stored in DRAM, and (ii) as a Tags-In-SRAM organization with the cache organized as larger blocks (typiclly 512B or larger) and metadata stored on SRAM. Tags-In-DRAM organizations tend to incur higher latency but conserve off-chip bandwidth while the Tags-In-SRAM organizations incur lower latency at some additional bandwidth. In this work, we develop a unified performance model of the DRAM-Cache that models these different organizational styles. The model is validated against detailed architecture simulations and shown to have latency estimation errors of 10:7% and 8:8% on average in 4-core and 8-core processors respectively. We also explore two insights from the model: (i) the need for achieving very high hit rates in the meta- data cache/predictor (commonly employed in the Tags-In-DRAM designs) in reducing latency, and (ii) opportunities for reducing latency by load-balancing the DRAM Cache and main memory.

R. Govindarajan | Mahesh Mehendale | Nagendra Dwarakanath Gulur

[1] Norman P. Jouppi,et al. CACTI: an enhanced cache access and cycle time model , 1996, IEEE J. Solid State Circuits.

[2] Vijayalakshmi Srinivasan,et al. On the Nature of Cache Miss Behavior: Is It √2? , 2008, J. Instr. Level Parallelism.

[3] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[4] Hsien-Hsin S. Lee,et al. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[5] Cong Xu,et al. Moguls: A model to explore the memory hierarchy for bandwidth improvements , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[6] Babak Falsafi,et al. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[7] Rami G. Melhem,et al. Writeback-aware bandwidth partitioning for multi-core systems with PCM , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[8] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[9] R. Govindarajan,et al. ANATOMY: an analytical model of memory system performance , 2014, SIGMETRICS '14.

[10] Jung Ho Ahn,et al. The Design Space of Data-Parallel Memory Systems , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[11] Tor M. Aamodt,et al. Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors , 2012, IEEE Transactions on Computers.

[12] Cheng-Chieh Huang,et al. ATCache: Reducing DRAM cache latency via a small SRAM tag cache , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[13] Mahmut T. Kandemir,et al. Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[14] Tor M. Aamodt,et al. Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[15] J DallyWilliam,et al. Memory access scheduling , 2000 .

[16] Mark D. Hill,et al. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17] Yan Solihin,et al. CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[18] Gabriel H. Loh,et al. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[19] Hyojin Choi,et al. Memory access pattern-aware DRAM performance model for multi-core systems , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[20] Tor M. Aamodt,et al. A Hybrid Analytical DRAM Performance Model , 2011 .

[21] Mark D. Hill,et al. A case for direct-mapped caches , 1988, Computer.

[22] Mark Horowitz,et al. An analytical cache model , 1989, TOCS.

[23] Fang Liu,et al. Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[24] G. Edward Suh,et al. Analytical cache models with applications to cache partitioning , 2001, ICS '01.

[25] John L. Henning. SPEC CPU2006 benchmark descriptions , 2006, CARN.

[26] Charles D. Pack,et al. The Output of an M/D/1 Queue , 1975, Oper. Res..

[27] R. Plackett,et al. Karl Pearson and the Chi-squared Test , 1983 .