论文信息 - Architecting On-Chip DRAM Cache for Simultaneous Miss Rate and Latency Reduction

Architecting On-Chip DRAM Cache for Simultaneous Miss Rate and Latency Reduction

On-chip dynamic random access memory (DRAM) cache has been recently employed in the memory hierarchy to mitigate the widening latency gap between high-speed cores and off-chip memory. Two important parameters are the DRAM cache miss rate (D$-MR) and the DRAM cache hit latency (D$-HL), as they strongly influence the performance. These parameters depend upon the DRAM set mapping policy. Recently proposed DRAM set mapping policies are predominantly optimized for either D$-MR or D$-HL. We propose novel DRAM set mapping policies that simultaneously reduce D$-MR (via high associativity) and D$-HL (via improved row buffer hit rates). To further improve the D$-HL, we propose a small and low latency DRAM Tag cache (DTC) structure that can quickly determine whether an access to the DRAM cache will be a hit or a miss. The performance of the proposed DTC depends upon the DTC hit rate. To increase it, we present a novel DTC insertion policy that also increases the DTC hit rate. We investigate the latency and miss rate tradeoffs when designing a DRAM cache hierarchy and analyze the effects of different policies on the overall performance. We evaluate our policies on a wide variety of workloads and compare its performance with three recent proposals for on-chip DRAM caches. For a 16-core system, our set mapping policy along with our DTC and its adaptive DTC insertion policy improve the harmonic mean instruction per cycle throughput by 25.4%, 15.5%, and 7.3% compared to state-of-the-art, while requiring 55% less storage overhead for DRAM cache hit/miss prediction.

Jörg Henkel | Lars Bauer | Fazal Hameed

[1] N. Gura,et al. UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[2] Gabriel H. Loh,et al. Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3] Gabriel H. Loh,et al. A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[4] Li Zhao,et al. Exploring DRAM cache architectures for CMP server platforms , 2007, 2007 25th International Conference on Computer Design.

[5] Babak Falsafi,et al. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[6] Gabriel H. Loh,et al. Zesto: A cycle-level simulator for highly detailed microarchitecture exploration , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[7] Koen De Bosschere,et al. 2FAR: A 2bcgskew Predictor Fused by an Alloyed Redundant History Skewed Perceptron Branch Predictor , 2005, J. Instr. Level Parallelism.

[8] Cheng-Chieh Huang,et al. ATCache: Reducing DRAM cache latency via a small SRAM tag cache , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[9] Yangdong Deng,et al. Interconnect characteristics of 2.5-D system integration scheme , 2001, ISPD '01.

[10] Manoj Franklin,et al. Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[11] Greg Hamerly,et al. SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[12] Brad Calder,et al. Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[13] Balaram Sinharoy,et al. The implementation of POWER7TM: A highly parallel and scalable multi-core high-end server processor , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[14] Mark D. Hill,et al. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15] GoveDarryl. CPU2006 working set size , 2007 .

[16] Steven Paul Hartman,et al. IBM POWER7 systems , 2011 .

[17] Jörg Henkel,et al. Reducing inter-core cache contention with an adaptive bank mapping policy in DRAM cache , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[18] Jörg Henkel,et al. Adaptive cache management for a combined SRAM and DRAM cache hierarchy for multi-cores , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19] Babak Falsafi,et al. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[20] Lei Jiang,et al. Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[21] Mark D. Hill,et al. Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap , 2012, IEEE Micro.

[22] Jörg Henkel,et al. Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[23] Gabriel H. Loh,et al. Resilient die-stacked DRAM caches , 2013, ISCA.

[24] Brad Calder,et al. SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[25] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[26] Darryl Gove,et al. CPU2006 working set size , 2007, CARN.

[27] Yan Solihin,et al. CHOP: Integrating DRAM Caches for CMP Server Platforms , 2011, IEEE Micro.

[28] Jörg Henkel,et al. Reducing latency in an SRAM/DRAM cache hierarchy via a novel Tag-Cache architecture , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[29] Young-Hyun Jun,et al. 8 Gb 3-D DDR3 DRAM Using Through-Silicon-Via Technology , 2009, IEEE Journal of Solid-State Circuits.

[30] Stijn Eyerman,et al. System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[31] S. Kim,et al. Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[32] Gabriel H. Loh,et al. 3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[33] Ieee Circuits,et al. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems information for authors , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.