There and back again: Optimizing the interconnect in networks of memory cubes

High-performance computing, enterprise, and datacenter servers are driving demands for higher total memory capacity as well as memory performance. Memory “cubes” with high per-package capacity (from 3D integration) along with high-speed point-to-point interconnects provide a scalable memory system architecture with the potential to deliver both capacity and performance. Multiple such cubes connected together can form a “Memory Network” (MN), but the design space for such MNs is quite vast, including multiple topology types and multiple memory technologies per memory cube. In this work, we first analyze several MN topologies with different mixes of memory package technologies to understand the key tradeoffs and bottlenecks for such systems. We find that most of a MN's performance challenges arise from the interconnection network that binds the memory cubes together. In particular, arbitration schemes used to route through MNs, ratio of NVM to DRAM, and specific topologies used have dramatic impact on performance and energy results. Our initial analysis indicates that introducing non-volatile memory to the MN presents a unique tradeoff between memory array latency and network latency. We observe that placing NVM cubes in a specific order in the MN improves performance by reducing the network size/diameter up to a certain NVM to DRAM ratio. Novel MN topologies and arbitration schemes also provide performance and energy deltas by reducing the hop count of requests and response in the MN. Based on our analyses, we introduce three techniques to address MN latency issues: (1) Distance-based arbitration scheme to improve queuing latencies throughout the network, (2) skip-list topology, derived from the classic data structure, to improve network latency and link usage, and (3) the MetaCube, a denser memory cube that leverages advanced packaging technologies to improve latency by reducing MN size.

[1]  William J. Dally,et al.  Express Cubes: Improving the Performance of k-Ary n-Cube Interconnection Networks , 1989, IEEE Trans. Computers.

[2]  Jung Ho Ahn,et al.  Evaluation of Performance Unfairness in NUMA System Architecture , 2017, IEEE Computer Architecture Letters.

[3]  John Kim,et al.  Multi-GPU System Design with Memory Networks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Mehrzad Samadi,et al.  Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.

[5]  David Roberts,et al.  NMI: A new memory interface to enable innovation , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[6]  Mahmut T. Kandemir,et al.  Evaluating STT-RAM as an energy-efficient main memory alternative , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  Florence March,et al.  2016 , 2016, Affair of the Heart.

[8]  Aamer Jaleel,et al.  Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[9]  David Blaauw,et al.  Hi-Rise: A High-Radix Switch for 3D Integration with Single-Cycle Arbitration , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Mike Ignatowski,et al.  Proposing an Abstracted Interface and Protocol for Computer Systems. , 2014 .

[11]  Yuangang Wang,et al.  A unified memory network architecture for in-memory computing in commodity servers , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Bronis R. de Supinski,et al.  HpMC: An Energy-aware Management System of Multi-level Memory Architectures , 2015, MEMSYS.

[13]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[14]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[15]  Yuan Xie,et al.  Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Luca Benini,et al.  A Logic-base Interconnect for Supporting Near Memory Computation in the Hybrid Memory Cube , 2014 .

[17]  Bruce Jacob,et al.  Buffer-on-board memory systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[18]  Jung Ho Ahn,et al.  Memory Network : Enabling Technology for Scalable Near-Data Computing , 2014 .

[19]  Yan Solihin,et al.  CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[20]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[21]  David Roberts,et al.  Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[22]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[23]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Paul Rosenfeld,et al.  Performance Exploration of the Hybrid Memory Cube , 2014 .

[25]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[26]  Josep Torrellas,et al.  Improving the performance of bristled CC-NUMA systems using virtual channels and adaptivity , 1999, ICS '99.

[27]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[28]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[29]  Jaeha Kim,et al.  Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[30]  Tejas Karkhanis,et al.  Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[31]  Yuangang Wang,et al.  Scalable memory fabric for silicon interposer-based multi-core systems , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[32]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[33]  Onur Mutlu,et al.  Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management , 2012, IEEE Computer Architecture Letters.

[34]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).