TicToc: Enabling Bandwidth-Efficient DRAM Caching for Both Hits and Misses in Hybrid Memory Systems

This paper investigates bandwidth-efficient DRAM caching for hybrid DRAM + 3D-XPoint memories. 3D-XPoint is becoming a viable alternative to DRAM as it enables high-capacity and non-volatile main memory systems. However, 3D-XPoint has several characteristics that limit it from outright replacing DRAM: 4-8x slower read, and even worse writes. As such, effective DRAM caching in front of 3D-XPoint is important to enable a high-capacity, low-latency, and high-write-bandwidth memory. There are currently two major approaches for DRAM cache design: (1) a Tag-Inside-Cacheline (TIC) organization that optimizes for hits, by storing tag next to each line such that one access gets both tag and data, and (2) a Tag-Outside-Cacheline (TOC) organization that optimizes for misses, by storing tags from multiple data lines together in a tag-line such that one access to a tag-line gets information on several data-lines. Ideally, we would like to have the low hit-latency of TIC designs, and the low miss-bandwidth of TOC designs. To this end, we propose a TicToc organization that provisions both TIC and TOC to get the hit and miss benefits of both. We find that naively combining both techniques actually performs worse than TIC individually, because one has to pay the bandwidth cost of maintaining both metadata. The main contribution of this work is developing architectural techniques to reduce bandwidth cost of accessing and maintaining both TIC and TOC metadata. We find that most of the update bandwidth is due to maintaining the TOC dirty information. We propose a DRAM Cache Dirtiness Bit technique that carries DRAM cache dirty information to last-level caches, to help prune repeated dirty-bit updates for known dirty lines. We also propose a Preemptive Dirty Marking (PDM) technique that predicts which lines will be written and proactively marks the dirty bit at install time, to help avoid the initial dirty-bit update for dirty lines. To support PDM, we develop a novel PC-based Write-Predictor to aid in marking only write-likely lines. Our evaluations on a 4GB DRAM cache in front of 3D-XPoint show that our TicToc organization enables 10% speedup over the baseline TIC, nearing the 14% speedup possible with an idealized DRAM cache design with 64MB of SRAM tags, while needing only 34KB SRAM.

[1]  Aamer Jaleel,et al.  CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Gabriel H. Loh,et al.  Resilient die-stacked DRAM caches , 2013, ISCA.

[3]  Carole-Jean Wu,et al.  SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Jinkyu Jeong,et al.  Efficient footprint caching for Tagless DRAM Caches , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[5]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[6]  Seth H. Pugsley,et al.  USIMM : the Utah SImulated Memory Module , 2012 .

[7]  Josep Torrellas,et al.  PageSeer: Using Page Walks to Trigger Page Swaps in Hybrid Memory Systems , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[8]  Yan Solihin,et al.  Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[9]  Alaa R. Alameldeen,et al.  Transparent Hardware Management of Stacked DRAM as Part of Memory , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Babak Falsafi,et al.  Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Parijat Dube,et al.  Architectural design for next generation heterogeneous memory systems , 2010, 2010 IEEE International Memory Workshop.

[12]  Srinivas Devadas,et al.  Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Xiao Liu,et al.  Basic Performance Measurements of the Intel Optane DC Persistent Memory Module , 2019, ArXiv.

[14]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[15]  Aamer Jaleel,et al.  ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[16]  Reena Panda,et al.  SILC-FM: Subblocked InterLeaved Cache-Like Flat Memory Organization , 2016, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[17]  Cheng-Chieh Huang,et al.  ATCache: Reducing DRAM cache latency via a small SRAM tag cache , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[18]  Aamer Jaleel,et al.  BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[19]  Gabriel H. Loh,et al.  Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  Mike O'Connor,et al.  Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[21]  Gabriel H. Loh,et al.  A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[22]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[23]  Gabriel H. Loh,et al.  Challenges in Heterogeneous Die-Stacked and Off-Chip Memory Systems , 2012 .

[24]  Tajana Simunic,et al.  PDRAM: A hybrid PRAM and DRAM main memory system , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[25]  Akhilesh Kumar,et al.  Cascade Lake: Next Generation Intel Xeon Scalable Processor , 2019, IEEE Micro.

[26]  C. Wilkerson,et al.  A Dueling Segmented LRU Replacement Algorithm with Adaptive Bypassing , 2010 .

[27]  Cheng-Chieh Huang,et al.  C3D: Mitigating the NUMA bottleneck via coherent DRAM caches , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Mark D. Hill,et al.  Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29]  Moinuddin K. Qureshi,et al.  DICE: Compressing DRAM caches for bandwidth and capacity , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[30]  Aamer Jaleel,et al.  SHiP + + : Enhancing Signature-Based Hit Predictor for Improved Cache Performance , 2017 .

[31]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[32]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[33]  Onur Mutlu,et al.  Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management , 2012, IEEE Computer Architecture Letters.

[34]  Jinkyu Jeong,et al.  A fully associative, tagless DRAM cache , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[35]  Aamer Jaleel,et al.  CANDY: Enabling coherent DRAM caches for multi-node systems , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[36]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[37]  Jeffrey B. Rothman,et al.  Sector cache design and performance , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[38]  Aamer Jaleel,et al.  Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39]  Dean M. Tullsen,et al.  MemPod: A Clustered Architecture for Efficient and Scalable Migration in Flat Address Space Multi-level Memories , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[40]  Tao Zhang,et al.  Building a Low Latency, Highly Associative DRAM Cache with the Buffered Way Predictor , 2016, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).