论文信息 - Parana: A Parallel Neural Architecture Considering Thermal Problem of 3D Stacked Memory

Parana: A Parallel Neural Architecture Considering Thermal Problem of 3D Stacked Memory

Recent advances in deep learning (DL) have stimulated increasing interests in neural networks (NN). From the perspective of operation type and network architecture, deep neural networks can be categorized into full convolution-based neural network (ConvNet), recurrent neural network (RNN), and fully-connected neural network (FCNet). Different types of neural networks are usually cascaded and combined as a hybrid neural network (Hybrid-NN) to complete real-life cognitive tasks. Such hybrid-NN implementation is memory-intensive with large number of memory accesses, hence the performance of hybrid-NN is often limited by the insufficient memory bandwidth. A “3D + 2.5D” integration system, which integrates a high-bandwidth 3D stacked DRAM side-by-side with a highly-parallel neural processing unit (NPU) on a silicon interposer, overcomes the bandwidth bottleneck in hybrid-NN acceleration. However, intensive concurrent 3D DRAM accesses produced by the NPU lead to a serious thermal problem in 3D DRAM. In this paper, we propose a neural processor called <italic>Parana</italic> for hybrid-NN acceleration in consideration of thermal problem of 3D DRAM. Parana solves the thermal problem of 3D memory by optimizing both the total number of memory accesses and memory accessing behaviors. For memory accessing behaviors, Parana balances the memory bandwidth by spatial division mapping hybrid-NN onto computing resources, which efficiently avoids that masses of memory accesses are issued in a short time period. To reduce the total number of memory accesses, we design a new NPU architecture and propose a memory-oriented tiling and scheduling mechanism to exploit the maximum utilization of on-chip buffer. Experimental results show that Parana reduces the peak temperature by up to 54.72 <inline-formula> <tex-math notation="LaTeX">$^\circ$</tex-math><alternatives><inline-graphic xlink:href="yin-ieq1-2858230.gif"/> </alternatives></inline-formula>C and the steady temperature by up to 32.27 <inline-formula><tex-math notation="LaTeX"> $^\circ$</tex-math><alternatives><inline-graphic xlink:href="yin-ieq2-2858230.gif"/></alternatives></inline-formula>C over state-of-the-art accelerators with 3D memory without performance degradation.

[1] Chrisantha Fernando,et al. PathNet: Evolution Channels Gradient Descent in Super Neural Networks , 2017, ArXiv.

[2] Kiyoung Choi,et al. Efficient FPGA acceleration of Convolutional Neural Networks using logical-3D compute array , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[3] Hoi-Jun Yoo,et al. 14.1 A 126.1mW real-time natural UI/UX processor with embedded deep-learning core for low-power smart glasses , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[4] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[5] Srihari Cadambi,et al. A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[6] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[7] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[8] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[9] Berin Martini,et al. A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[10] Yann LeCun,et al. CNP: An FPGA-based processor for Convolutional Networks , 2009, 2009 International Conference on Field Programmable Logic and Applications.

[11] Gabriel H. Loh,et al. 3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[12] Kevin Skadron,et al. HotSpot 6.0: Validation, Acceleration and Extension , 2015 .

[13] Avinash Sodani,et al. Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[14] R. Sindhu Reddy,et al. DLAU: A Scalable Deep Learning Accelerator Unit on FPGA , 2018 .

[15] Richard Veras,et al. RAIDR: Retention-aware intelligent DRAM refresh , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[16] Joel Emer,et al. Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[17] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[18] Jian Ouyang. SDA: Software-defined accelerator for large-scale deep learning system , 2016, VLSI-DAT 2016.

[19] Jaejin Lee,et al. 25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[20] Yong Liu,et al. A 45nm CMOS neuromorphic chip with a scalable architecture for learning in networks of spiking neurons , 2011, 2011 IEEE Custom Integrated Circuits Conference (CICC).

[21] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[22] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23] Lukasz Kaiser,et al. One Model To Learn Them All , 2017, ArXiv.

[24] Luca Benini,et al. A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[25] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[26] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[27] Dong Han,et al. Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[28] Yu Wang,et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[29] Marian Verhelst,et al. 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[30] Manoj Alwani,et al. Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31] Yan Luo,et al. Accelerator of Stacked Convolutional Independent Subspace Analysis for Deep Learning-Based Action Recognition , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[32] Jun-Seok Park,et al. 14.6 A 1.42TOPS/W deep convolutional neural network recognition processor for intelligent IoE systems , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[33] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[34] Yu Cao,et al. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[35] Jung Ho Ahn,et al. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[36] M. L. Campbell,et al. 3D wafer stack neurocomputing , 1993, 1993 Proceedings Fifth Annual IEEE International Conference on Wafer Scale Integration.

[37] Kaustav Banerjee,et al. A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[38] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[39] Krisztián Flautner,et al. PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor , 2006, ASPLOS XII.

[40] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41] Dharmendra S. Modha,et al. A digital neurosynaptic core using embedded crossbar memory with 45pJ per spike in 45nm , 2011, 2011 IEEE Custom Integrated Circuits Conference (CICC).

[42] Henk Corporaal,et al. Memory-centric accelerator design for Convolutional Neural Networks , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[43] Xuegong Zhou,et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[44] Tuo-Hung Hou,et al. 3D synaptic architecture with ultralow sub-10 fJ energy per spike for neuromorphic computation , 2014, 2014 IEEE International Electron Devices Meeting.

[45] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[46] Michael Ferdman,et al. Overcoming resource underutilization in spatial CNN accelerators , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[47] Shaoli Liu,et al. Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[48] Jason Cong,et al. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[49] Yuan Xie,et al. Cost-aware three-dimensional (3D) many-core multiprocessor design , 2010, Design Automation Conference.

[50] Bharadwaj Veeravalli,et al. Temperature aware energy-reliability trade-offs for mapping of throughput-constrained applications on multimedia MPSoCs , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[51] Yuan Xie,et al. Die-stacking Architecture , 2015, Die-stacking Architecture.

[52] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[53] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[54] Srihari Cadambi,et al. A Massively Parallel Coprocessor for Convolutional Neural Networks , 2009, 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors.

[55] Xiaowei Li,et al. C-Brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[56] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.