论文信息 - On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators

On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators

Deep neural network (DNN) inference tasks have become ubiquitous workloads on mobile SoCs and demand energy-efficient hardware accelerators. Mobile DNN accelerators are heavily area-constrained, with only minimal on-chip SRAM, which results in heavy use of inefficient off-chip DRAM. With diminishing returns from conventional silicon technology scaling, emerging memory technologies that offer better area density than SRAM can boost accelerator efficiency by minimizing costly off-chip DRAM accesses. This paper presents a detailed design space exploration (DSE) of technology-system co-design for systolic-array accelerators. We focus on practical/mature on-chip memory technologies, including SRAM, eDRAM, MRAM, and 3D vertical RRAM (VRRAM). The DSE employs state-of-the-art optimizations (e.g., model compression and optimized buffer scheduling), and evaluates results on important models including ResNet-50, MobileNet, and Faster-RCNN. Compared to an SRAM/DRAM baseline, MRAM-based accelerators show up to 4.68× energy benefits (57% area overhead), while a 3D VRRAM-based design achieves 2.22 × energy benefits (33% area reduction).

[1] Gu-Yeon Wei,et al. DNN Engine: A 28-nm Timing-Error Tolerant Sparse Deep Neural Network Processor for IoT Applications , 2018, IEEE Journal of Solid-State Circuits.

[2] Seung H. Kang,et al. Systematic optimization of 1 Gbit perpendicular magnetic tunnel junction arrays for 28 nm embedded STT-MRAM and beyond , 2015, 2015 IEEE International Electron Devices Meeting (IEDM).

[3] Alexander Fish,et al. An 800-MHz Mixed- $V_{\text{T}}$ 4T IFGC Embedded DRAM in 28-nm CMOS Bulk Process for Approximate Storage Applications , 2018, IEEE Journal of Solid-State Circuits.

[4] G. Northrop,et al. High performance 14nm SOI FinFET CMOS technology with 0.0174µm2 embedded DRAM and 15 levels of Cu metallization , 2014, 2014 IEEE International Electron Devices Meeting.

[5] Matthew Mattina,et al. SCALE-Sim: Systolic CNN Accelerator , 2018, ArXiv.

[6] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[7] Yu Cao,et al. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[8] Alexander Fish,et al. Live Demonstration: An 800 Mhz Gain-Cell Embedded DRAM in 28 nm CMOS Bulk Process for Approximate Computing Applications , 2018, 2018 IEEE International Symposium on Circuits and Systems (ISCAS).

[9] H.-S. Philip Wong,et al. 14.3 A 43pJ/Cycle Non-Volatile Microcontroller with 4.7μs Shutdown/Wake-up Integrating 2.3-bit/Cell Resistive RAM and Resilience Techniques , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[10] H-S Philip Wong,et al. Memory leads the way to better computing. , 2015, Nature nanotechnology.

[11] Jonathan Chang,et al. A 5GHz 7nm L1 cache memory compiler for high-speed computing and mobile applications , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[12] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[13] Guo-Wei Huang,et al. First fully functionalized monolithic 3D+ IoT chip with 0.5 V light-electricity power management, 6.8 GHz wireless-communication VCO, and 4-layer vertical ReRAM , 2016, 2016 IEEE International Electron Devices Meeting (IEDM).

[14] David Blaauw,et al. A 1Mb 28nm STT-MRAM with 2.8ns read access time at 1.2V VDD using single-cap offset-cancelled sense amplifier and in-situ self-write-termination , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[15] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[16] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[17] Tao Zhang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[18] Joel Emer,et al. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[19] S. Natarajan,et al. A high-performance, high-density 28nm eDRAM technology with high-K/metal-gate , 2011, 2011 International Electron Devices Meeting.

[20] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[21] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).