Walter: Wide I/O Scaling of Number of Memory Controllers Versus Frequency and Voltage

Computational application demands do push the scaling of the number of cores, which themselves further increase the demand for more bandwidth. The use of larger rank widths and/or scaling the number of memory controllers (MCs) is a straightforward way to increase memory bandwidth. Connecting wide ranks and MCs via low-capacitance Through Silicon Vias (TSVs) favors high-bandwidth 3DStacking systems (e.g. Wide I/O). Given that voltage and frequency scaling (VFS) lower power utilization but the use of lower clock frequencies reduces bandwidth, this article proposes <inline-formula> <tex-math notation="LaTeX">$Walter$ </tex-math></inline-formula> as a <inline-formula> <tex-math notation="LaTeX">$W$ </tex-math></inline-formula>ide I/O technique that trades off sc<inline-formula> <tex-math notation="LaTeX">$al$ </tex-math></inline-formula>ing of the number of memory con<inline-formula> <tex-math notation="LaTeX">$t$ </tex-math></inline-formula>roll<inline-formula> <tex-math notation="LaTeX">$e$ </tex-math></inline-formula>rs (MCs) versus clock <inline-formula> <tex-math notation="LaTeX">$\text{f}r$ </tex-math></inline-formula>equency and voltage (VFS) to mitigate low bandwidth and improve energy-per-bit usage. Our findings show that <inline-formula> <tex-math notation="LaTeX">$Walter$ </tex-math></inline-formula>’s Wide I/O architectural benefits of using a larger number of MCs coupled with wider ranks when combined to VFS are promising: compared to the baseline for a 75% frequency/voltage reduction, MC scalability improved memory bandwidth by 2.4x and energy-per-bit reduced by 20% (most benchmarks for up to 16 MCs). <inline-formula> <tex-math notation="LaTeX">$Walter$ </tex-math></inline-formula>’s architectural replacement of ranks set at specification frequencies with ones set at lower frequencies allows temperature reduction thus likely allowing further rank stacking.

[1]  Pavan Kumar Hanumolu,et al.  Understanding and Optimizing Power Consumption in Memory Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2]  Xu Cheng,et al.  Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[3]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Onur Mutlu,et al.  Understanding the Interactions of Workloads and DRAM Types: A Comprehensive Experimental Study , 2019, ArXiv.

[5]  Qingyuan Deng,et al.  MemScale: active low-power modes for main memory , 2011, ASPLOS XVI.

[6]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[7]  Onur Mutlu,et al.  SysScale: Exploiting Multi-domain Dynamic Voltage and Frequency Scaling for Energy Efficient Mobile Processors , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[8]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Mahmut T. Kandemir,et al.  Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Martin Dimitrov,et al.  A framework for application guidance in virtual memory systems , 2013, VEE '13.

[11]  Soha Hassoun,et al.  Power Delivery Design for 3-D ICs Using Different Through-Silicon Via (TSV) Technologies , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[12]  Josep Torrellas,et al.  Snatch: Opportunistically reassigning power allocation between processor and memory in 3D stacks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[14]  Kuan-Ching Li,et al.  RAMON: Region-Aware Memory Controller , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15]  Thomas F. Wenisch,et al.  MultiScale: memory system DVFS with multiple memory controllers , 2012, ISLPED '12.

[16]  Jason Cong,et al.  The DIMM tree architecture: A high bandwidth and scalable memory system , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[17]  Thomas F. Wenisch,et al.  CoScale: Coordinating CPU and Memory System DVFS in Server Systems , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[18]  Norbert Wehn,et al.  System and circuit level power modeling of energy-efficient 3D-stacked wide I/O DRAMs , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[20]  David Atienza,et al.  3D-ICE: Fast compact transient thermal modeling for 3D ICs with inter-tier liquid cooling , 2010, 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[21]  Kees G. W. Goossens,et al.  Improved Power Modeling of DDR SDRAMs , 2011, 2011 14th Euromicro Conference on Digital System Design.

[22]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.