Architecting On-Chip DRAM Cache for Simultaneous Miss Rate and Latency Reduction

On-chip dynamic random access memory (DRAM) cache has been recently employed in the memory hierarchy to mitigate the widening latency gap between high-speed cores and off-chip memory. Two important parameters are the DRAM cache miss rate (D$-MR) and the DRAM cache hit latency (D$-HL), as they strongly influence the performance. These parameters depend upon the DRAM set mapping policy. Recently proposed DRAM set mapping policies are predominantly optimized for either D$-MR or D$-HL. We propose novel DRAM set mapping policies that simultaneously reduce D$-MR (via high associativity) and D$-HL (via improved row buffer hit rates). To further improve the D$-HL, we propose a small and low latency DRAM Tag cache (DTC) structure that can quickly determine whether an access to the DRAM cache will be a hit or a miss. The performance of the proposed DTC depends upon the DTC hit rate. To increase it, we present a novel DTC insertion policy that also increases the DTC hit rate. We investigate the latency and miss rate tradeoffs when designing a DRAM cache hierarchy and analyze the effects of different policies on the overall performance. We evaluate our policies on a wide variety of workloads and compare its performance with three recent proposals for on-chip DRAM caches. For a 16-core system, our set mapping policy along with our DTC and its adaptive DTC insertion policy improve the harmonic mean instruction per cycle throughput by 25.4%, 15.5%, and 7.3% compared to state-of-the-art, while requiring 55% less storage overhead for DRAM cache hit/miss prediction.

[1]  N. Gura,et al.  UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[2]  Gabriel H. Loh,et al.  Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Gabriel H. Loh,et al.  A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Li Zhao,et al.  Exploring DRAM cache architectures for CMP server platforms , 2007, 2007 25th International Conference on Computer Design.

[5]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[6]  Gabriel H. Loh,et al.  Zesto: A cycle-level simulator for highly detailed microarchitecture exploration , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[7]  Koen De Bosschere,et al.  2FAR: A 2bcgskew Predictor Fused by an Alloyed Redundant History Skewed Perceptron Branch Predictor , 2005, J. Instr. Level Parallelism.

[8]  Cheng-Chieh Huang,et al.  ATCache: Reducing DRAM cache latency via a small SRAM tag cache , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[9]  Yangdong Deng,et al.  Interconnect characteristics of 2.5-D system integration scheme , 2001, ISPD '01.

[10]  Manoj Franklin,et al.  Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[11]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[12]  Brad Calder,et al.  Using SimPoint for accurate and efficient simulation , 2003, SIGMETRICS '03.

[13]  Balaram Sinharoy,et al.  The implementation of POWER7TM: A highly parallel and scalable multi-core high-end server processor , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[14]  Mark D. Hill,et al.  Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  GoveDarryl CPU2006 working set size , 2007 .

[16]  Steven Paul Hartman,et al.  IBM POWER7 systems , 2011 .

[17]  Jörg Henkel,et al.  Reducing inter-core cache contention with an adaptive bank mapping policy in DRAM cache , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[18]  Jörg Henkel,et al.  Adaptive cache management for a combined SRAM and DRAM cache hierarchy for multi-cores , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Babak Falsafi,et al.  Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  Lei Jiang,et al.  Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[21]  Mark D. Hill,et al.  Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap , 2012, IEEE Micro.

[22]  Jörg Henkel,et al.  Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[23]  Gabriel H. Loh,et al.  Resilient die-stacked DRAM caches , 2013, ISCA.

[24]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[25]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[26]  Darryl Gove,et al.  CPU2006 working set size , 2007, CARN.

[27]  Yan Solihin,et al.  CHOP: Integrating DRAM Caches for CMP Server Platforms , 2011, IEEE Micro.

[28]  Jörg Henkel,et al.  Reducing latency in an SRAM/DRAM cache hierarchy via a novel Tag-Cache architecture , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[29]  Young-Hyun Jun,et al.  8 Gb 3-D DDR3 DRAM Using Through-Silicon-Via Technology , 2009, IEEE Journal of Solid-State Circuits.

[30]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[31]  S. Kim,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[32]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[33]  Ieee Circuits,et al.  IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems information for authors , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.