A Software-Managed Approach to Die-Stacked DRAM

Advances in die-stacking (3D) technology have enabled the tight integration of significant quantities of DRAM with high-performance computation logic. How to integrate this technology into the overall architecture of a computing system is an open question. While much recent effort has focused on hardware-based techniques for using die-stacked memory (e.g., caching), in this paper we explore what it takes for a software-driven approach to be effective. First we consider exposing die-stacked DRAM directly to applications, relying on the static partitioning of allocations between fast on-chip and slow off-chip DRAM. We see only marginal benefits from this approach (9% speedup). Next, we explore OS-based page caches that dynamically partition application memory, but we find such approaches to be worse than not having stacked DRAM at all! We analyze the performance bottlenecks in OS page caches, and propose two simple techniques that make the OS approach viable. The first is a hardware-assisted TLB shoot-down, which is a more general mechanism that is valuable beyond stacked DRAM, and enables OS-managed page caches to achieve a 27% speedup, the second is a software-implemented prefetcher that extends classic hardware prefetching algorithms to the page level, leading to 39% speedup. With these simple and lightweight components, the OS page cache can provide 70% of the performance benefit that would be achievable with an ideal and unrealistic system where all of main memory is die-stacked. However, we also found that applications with poor locality (e.g., graph analyses) are not amenable to any page-caching schemes -- whether hardware or software -- and therefore we recommend that the system still provides APIs to the application layers to explicitly control die-stacked DRAM allocations.

[1]  Daniel J. Sorin,et al.  UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[2]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[3]  Carla Schlatter Ellis,et al.  Evaluation of NUMA Memory Management Through Modeling and Measurements , 1992, IEEE Trans. Parallel Distributed Syst..

[4]  Kern Koh,et al.  Lazy TLB consistency for large-scale multiprocessors , 1997, Proceedings of IEEE International Symposium on Parallel Algorithms Architecture Synthesis.

[5]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[6]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[7]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[8]  Hsien-Hsin S. Lee,et al.  An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[9]  Margaret Martonosi,et al.  Inter-core cooperative TLB for chip multiprocessors , 2010, ASPLOS XV.

[10]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[11]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[12]  Mark D. Hill,et al.  Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[14]  Gabriel H. Loh,et al.  A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[15]  Li Zhao,et al.  Exploring DRAM cache architectures for CMP server platforms , 2007, 2007 25th International Conference on Computer Design.

[16]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[17]  Jean-Loup Baer,et al.  The Impact of Timeliness for Hardware-based Prefetching from Main Memory , 1997 .

[18]  Michael L. Scott,et al.  Simple but effective techniques for NUMA memory management , 1989, SOSP '89.

[19]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[20]  Mahmut T. Kandemir,et al.  Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[21]  Margaret Martonosi,et al.  TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.

[22]  Hsien-Hsin S. Lee,et al.  3D-MAPS: 3D Massively parallel processor with stacked memory , 2012, 2012 IEEE International Solid-State Circuits Conference.

[23]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[24]  John S. Liptay,et al.  Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..

[25]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[26]  Willy Zwaenepoel,et al.  Munin: distributed shared memory based on type-specific memory coherence , 1990, PPOPP '90.

[27]  Mark D. Hill,et al.  Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap , 2012, IEEE Micro.

[28]  Thomas R. Gross,et al.  Memory system performance in a NUMA multicore multiprocessor , 2011, SYSTOR '11.

[29]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[30]  Tim Brecht,et al.  On the importance of parallel application placement in NUMA multiprocessors , 1993 .

[31]  Yan Solihin,et al.  CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[32]  Gabriel H. Loh,et al.  Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[33]  Yangdong Deng,et al.  Interconnect characteristics of 2.5-D system integration scheme , 2001, ISPD '01.

[34]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX Annual Technical Conference.

[35]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[36]  Natalie D. Enright Jerger,et al.  A dual grain hit-miss detector for large Die-Stacked DRAM caches , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[37]  J. ContiC.,et al.  Structural aspects of the system/360 model 85 , 1968 .

[38]  David L. Black,et al.  Translation lookaside buffer consistency: a software approach , 1989, ASPLOS III.

[39]  Anoop Gupta,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.

[40]  Ricardo Bianchini,et al.  Page placement in hybrid memory systems , 2011, ICS '11.

[41]  David Blaauw,et al.  Centip3De: A 3930DMIPS/W configurable near-threshold 3D stacked system with 64 ARM Cortex-M3 cores , 2012, 2012 IEEE International Solid-State Circuits Conference.

[42]  Avi Mendelson,et al.  DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[43]  Alexandra Fedorova,et al.  A case for NUMA-aware contention management on multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[44]  Trevor N. Mudge,et al.  A look at several memory management units, TLB-refill mechanisms, and page table organizations , 1998, ASPLOS VIII.

[45]  Vivien Quéma,et al.  Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.

[46]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.