On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators

Deep neural network (DNN) inference tasks have become ubiquitous workloads on mobile SoCs and demand energy-efficient hardware accelerators. Mobile DNN accelerators are heavily area-constrained, with only minimal on-chip SRAM, which results in heavy use of inefficient off-chip DRAM. With diminishing returns from conventional silicon technology scaling, emerging memory technologies that offer better area density than SRAM can boost accelerator efficiency by minimizing costly off-chip DRAM accesses. This paper presents a detailed design space exploration (DSE) of technology-system co-design for systolic-array accelerators. We focus on practical/mature on-chip memory technologies, including SRAM, eDRAM, MRAM, and 3D vertical RRAM (VRRAM). The DSE employs state-of-the-art optimizations (e.g., model compression and optimized buffer scheduling), and evaluates results on important models including ResNet-50, MobileNet, and Faster-RCNN. Compared to an SRAM/DRAM baseline, MRAM-based accelerators show up to 4.68× energy benefits (57% area overhead), while a 3D VRRAM-based design achieves 2.22 × energy benefits (33% area reduction).

[1]  Gu-Yeon Wei,et al.  DNN Engine: A 28-nm Timing-Error Tolerant Sparse Deep Neural Network Processor for IoT Applications , 2018, IEEE Journal of Solid-State Circuits.

[2]  Seung H. Kang,et al.  Systematic optimization of 1 Gbit perpendicular magnetic tunnel junction arrays for 28 nm embedded STT-MRAM and beyond , 2015, 2015 IEEE International Electron Devices Meeting (IEDM).

[3]  Alexander Fish,et al.  An 800-MHz Mixed- $V_{\text{T}}$ 4T IFGC Embedded DRAM in 28-nm CMOS Bulk Process for Approximate Storage Applications , 2018, IEEE Journal of Solid-State Circuits.

[4]  G. Northrop,et al.  High performance 14nm SOI FinFET CMOS technology with 0.0174µm2 embedded DRAM and 15 levels of Cu metallization , 2014, 2014 IEEE International Electron Devices Meeting.

[5]  Matthew Mattina,et al.  SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[6]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[7]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[8]  Alexander Fish,et al.  Live Demonstration: An 800 Mhz Gain-Cell Embedded DRAM in 28 nm CMOS Bulk Process for Approximate Computing Applications , 2018, 2018 IEEE International Symposium on Circuits and Systems (ISCAS).

[9]  H.-S. Philip Wong,et al.  14.3 A 43pJ/Cycle Non-Volatile Microcontroller with 4.7μs Shutdown/Wake-up Integrating 2.3-bit/Cell Resistive RAM and Resilience Techniques , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[10]  H-S Philip Wong,et al.  Memory leads the way to better computing. , 2015, Nature nanotechnology.

[11]  Jonathan Chang,et al.  A 5GHz 7nm L1 cache memory compiler for high-speed computing and mobile applications , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[12]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[13]  Guo-Wei Huang,et al.  First fully functionalized monolithic 3D+ IoT chip with 0.5 V light-electricity power management, 6.8 GHz wireless-communication VCO, and 4-layer vertical ReRAM , 2016, 2016 IEEE International Electron Devices Meeting (IEDM).

[14]  David Blaauw,et al.  A 1Mb 28nm STT-MRAM with 2.8ns read access time at 1.2V VDD using single-cap offset-cancelled sense amplifier and in-situ self-write-termination , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[15]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[16]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[17]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[18]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[19]  S. Natarajan,et al.  A high-performance, high-density 28nm eDRAM technology with high-K/metal-gate , 2011, 2011 International Electron Devices Meeting.

[20]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[21]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).