Toward Efficient Programmer-Managed Two-Level Memory Hierarchies in Exascale Computers

Future exascale systems will require very aggressive memory systems simultaneously delivering huge storage capacities and multi-TB/s bandwidths. To achieve the bandwidth targets, in-package, die-stacked memory technologies will likely be necessary. However, these integrated memories do not provide enough capacity to achieve the overall per-node memory size requirements. As a result, conventional off-package memory (e.g., DIMMs) will still be needed. This creates a "two-level memory" (TLM) organization where a portion of the machine's memory space provides high bandwidth, and the remainder provides capacity at a lower level of performance. Effective use of such a heterogeneous memory organization may require the co-design of the software applications along with the advancements in memory architecture. In this paper, we explore the efficacy of programmer-driven approaches to managing a TLM system, using three Exascale proxy applications as case studies.

[1]  Young-Hyun Jun,et al.  A 1.2V 12.8GB/s 2Gb mobile Wide-I/O DRAM with 4×128 I/Os using TSV-based stacking , 2011, 2011 IEEE International Solid-State Circuits Conference.

[2]  Maya Gokhale,et al.  DI-MMAP: A High Performance Memory-Map Runtime for Data-Intensive Applications , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[3]  Young-Hyun Jun,et al.  A 1.2 V 12.8 GB/s 2 Gb Mobile Wide-I/O DRAM With 4 $\times$ 128 I/Os Using TSV Based Stacking , 2011, IEEE Journal of Solid-State Circuits.

[4]  Gabriel H. Loh,et al.  Challenges in Heterogeneous Die-Stacked and Off-Chip Memory Systems , 2012 .

[5]  Jörg Henkel,et al.  Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[6]  FalsafiBabak,et al.  Die-stacked DRAM caches for servers , 2013 .

[7]  R. Hornung,et al.  HYDRODYNAMICS CHALLENGE PROBLEM , 2011 .

[8]  Yan Solihin,et al.  CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[9]  Gabriel H. Loh,et al.  Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[11]  Onur Mutlu,et al.  Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management , 2012, IEEE Computer Architecture Letters.

[12]  Sung Kyu Lim,et al.  A study of stacking limit and scaling in 3D ICs: an interconnect perspective , 2009, 2009 59th Electronic Components and Technology Conference.

[13]  Hsien-Hsin S. Lee,et al.  Designing 3D test wrappers for pre-bond and post-bond test of 3D embedded cores , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[14]  Kesheng Wu,et al.  Scientific Discovery at the Exascale , 2011 .

[15]  Natalie D. Enright Jerger,et al.  A dual grain hit-miss detector for large Die-Stacked DRAM caches , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[16]  Gabriel H. Loh,et al.  A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[17]  Li Zhao,et al.  Exploring DRAM cache architectures for CMP server platforms , 2007, 2007 25th International Conference on Computer Design.

[18]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[19]  Lei Jiang,et al.  Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).