Memory-centric system interconnect design with hybrid memory cubes

Memory bandwidth has been one of the most critical system performance bottlenecks. As a result, the HMC (Hybrid Memory Cube) has recently been proposed to improve DRAM bandwidth as well as energy efficiency. In this paper, we explore different system interconnect designs with HMCs. We show that processor-centric network architectures cannot fully utilize processor bandwidth across different traffic patterns. Thus, we propose a memory-centric network in which all processor channels are connected to HMCs and not to any other processors as all communication between processors goes through intermediate HMCs. Since there are multiple HMCs per processor, we propose a distributor-based network to reduce the network diameter and achieve lower latency while properly distributing the bandwidth across different routers and providing path diversity. Memory-centric networks lead to some challenges including higher processor-to-processor latency and the need to properly exploit the path diversity. We propose a pass-through microarchitecture, which, in combination with the proper intra-HMC organization, reduces the zero-load latency while exploiting adaptive (and non-minimal) routing to load-balance across different channels. Our results show that memory-centric networks can efficiently utilize processor bandwidth for different traffic patterns and achieve higher performance by providing higher memory bandwidth and lower latency.

[1]  Yuan Xie,et al.  Energy-efficient GPU design with reconfigurable in-package graphics memory , 2012, ISLPED '12.

[2]  William J. Dally,et al.  Flattened butterfly: a cost-efficient topology for high-radix networks , 2007, ISCA '07.

[3]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[4]  Richard W. Vuduc,et al.  Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[5]  William J. Dally,et al.  Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[6]  Robert J. Safranek,et al.  Intel® QuickPath Interconnect Architectural Features Supporting Scalable System Architectures , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[7]  William J. Dally,et al.  Cost-Efficient Dragonfly Topology for Large-Scale Systems , 2009, IEEE Micro.

[8]  John R. Feehrer,et al.  The Oracle Sparc T5 16-Core Processor Scales to Eight Sockets , 2013, IEEE Micro.

[9]  Goichi Ono,et al.  A 12.3mW 12.5Gb/s complete transceiver in 65nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[10]  Nan Jiang,et al.  A detailed and flexible cycle-accurate Network-on-Chip simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[11]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[12]  Vivek De,et al.  Life is CMOS: why chase the life after? , 2002, DAC '02.

[13]  Marcelo Cintra,et al.  Stream chaining: exploiting multiple levels of correlation in data prefetching , 2009, ISCA '09.

[14]  Jose-Maria Arnau,et al.  Boosting mobile GPU performance with a decoupled access/execute fragment processor , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[15]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[16]  G. Edward Suh,et al.  SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[17]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[18]  Bruce Jacob,et al.  Buffer-on-board memory systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[19]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[20]  Bruce Jacob,et al.  The Memory System: You Can't Avoid It, You Can't Ignore It, You Can't Fake It , 2009, The Memory System: You Can't Avoid It, You Can't Ignore It, You Can't Fake It.

[21]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO 1992.

[22]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[23]  O Seongil,et al.  McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[24]  Tao Tang,et al.  Power Optimization for GPU Programs Based on Software Prefetching , 2011, 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications.

[25]  Sriram R. Vangal,et al.  A 2 Tb/s 6$\,\times\,$ 4 Mesh Network for a Single-Chip Cloud Computer With DVFS in 45 nm CMOS , 2011, IEEE Journal of Solid-State Circuits.

[26]  Joseph Antony,et al.  Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport , 2006, HiPC.

[27]  William J. Dally,et al.  Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[28]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[29]  Jung Ho Ahn,et al.  The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing , 2013, TACO.

[30]  Homan Igehy,et al.  Prefetching in a texture cache architecture , 1998, Workshop on Graphics Hardware.

[31]  M. Horowitz,et al.  A 14-mW 6.25-Gb/s Transceiver in 90-nm CMOS , 2007, IEEE Journal of Solid-State Circuits.

[32]  Kevin Skadron,et al.  Increasing memory miss tolerance for SIMD cores , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[33]  William J. Dally,et al.  The BlackWidow High-Radix Clos Network , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[34]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[35]  Onur Mutlu,et al.  Coordinated control of multiple prefetchers in multi-core systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[36]  Aamer Jaleel,et al.  Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[37]  Scott A. Mahlke,et al.  PEPSC: A Power-Efficient Processor for Scientific Computing , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[38]  Jason Cong,et al.  Utilizing Radio-Frequency Interconnect for a Many-DIMM DRAM System , 2012, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[39]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[40]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[41]  Norman P. Jouppi,et al.  Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[42]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[43]  A. Shubat,et al.  Mappable peripheral memory for high speed applications , 1989, Proceedings. VLSI and Computer Peripherals. COMPEURO 89.

[44]  Norman P. Jouppi,et al.  CACTI 2.0: An Integrated Cache Timing and Power Model , 2002 .

[45]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[46]  Mark Horowitz,et al.  Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.

[47]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[48]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[49]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[50]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[51]  Mike Higgins,et al.  Cray Cascade: A scalable HPC system based on a Dragonfly network , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[52]  Larry Kaplan,et al.  The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.