论文信息 - An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators

An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators

Deep Neural Networks (DNNs) have shown significant advantages in many domains, such as pattern recognition, prediction, and control optimization. The edge computing demand in the Internet-of-Things (IoTs) era has motivated many kinds of computing platforms to accelerate DNN operations. However, due to the massive parallel processing, the performance of the current large-scale artificial neural network is often limited by the huge communication overheads and storage requirements. As a result, efficient interconnection and data movement mechanisms for future on-chip artificial intelligence (AI) accelerators are worthy of study. Currently, a large body of research aims to find an efficient on-chip interconnection to achieve low-power and high-bandwidth DNN computing. This paper provides a comprehensive investigation of the recent advances in efficient on-chip interconnection and design methodology of the DNN accelerator design. First, we provide an overview of the different interconnection methods on the DNN accelerator. Then, the interconnection methods on the non-ASIC DNN accelerator will be discussed. On the other hand, with the flexible interconnection, the DNN accelerator can support different computing flow, which increases the computing flexibility. With this motivation, reconfigurable DNN computing with flexible on-chip interconnection will be investigated in this paper. Finally, we investigate the emerging interconnection technologies (e.g., in/near-memory processing) for the DNN accelerator design. This paper systematically investigates the interconnection networks in modern DNN accelerator designs. With this article, the readers are able to: 1) understand the interconnection design for DNN accelerators; 2) evaluate DNNs with different on-chip interconnection; 3) familiarize with the trade-offs under different interconnections.

[1] Bing Chen,et al. A general memristor-based partial differential equation solver , 2018, Nature Electronics.

[2] Philip Heng Wai Leong,et al. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[3] Hao Jiang,et al. A Memristor Crossbar Based Computing Engine Optimized for High Speed and Accuracy , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[4] Kunle Olukotun,et al. Plasticine: A reconfigurable architecture for parallel patterns , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[5] Mahmut T. Kandemir,et al. ResiRCA: A Resilient Energy Harvesting ReRAM Crossbar-Based Accelerator for Intelligent Embedded Processors , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[6] Yuan Xie,et al. FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture , 2019, ASPLOS.

[7] Dejan S. Milojicic,et al. PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference , 2019, ASPLOS.

[8] Lena Mashayekhy,et al. ApproxDNN: Incentivizing DNN Approximation in Cloud , 2020, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).

[9] Eitan Medina,et al. Habana Labs Purpose-Built AI Inference and Training Processor Architectures: Scaling AI Training Systems Using Standard Ethernet With Gaudi Processor , 2020, IEEE Micro.

[10] Michael Ferdman,et al. Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[11] David R. Kaeli,et al. Profiling DNN Workloads on a Volta-based DGX-1 System , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[12] Xu Liu,et al. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect , 2019, IEEE Transactions on Parallel and Distributed Systems.

[13] Kun-Chih Chen,et al. A NoC-based simulator for design and evaluation of deep neural networks , 2020, Microprocess. Microsystems.

[14] Bing Li,et al. RED: A ReRAM-Based Efficient Accelerator for Deconvolutional Computation , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[15] Yiran Chen,et al. ReCom: An efficient resistive accelerator for compressed deep neural networks , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[16] James Demmel,et al. Scaling Deep Learning on GPU and Knights Landing clusters , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17] Hyeran Jeon,et al. Graph processing on GPUs: Where are the bottlenecks? , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[18] Stefano Markidis,et al. Performance Evaluation of Advanced Features in CUDA Unified Memory , 2019, 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC).

[19] Larry R. Dennison,et al. Why Data Science and Machine Learning Need Silicon Photonics , 2020, 2020 Optical Fiber Communications Conference and Exhibition (OFC).

[20] Massimo Alioto,et al. Guest Editorial Energy-Quality Scalable Circuits and Systems for Sensing and Computing: From Approximate to Communication-Inspired and Learning-Based , 2018, IEEE J. Emerg. Sel. Topics Circuits Syst..

[21] Gerard J. M. Smit,et al. Fixed latency on-chip interconnect for hardware spiking neural network architectures , 2013, Parallel Comput..

[22] Wolfgang Straßer,et al. Fast and Scalable CPU/GPU Collision Detection for Rigid and Deformable Surfaces , 2010, Comput. Graph. Forum.

[23] Asit K. Mishra,et al. From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24] Daan Wierstra,et al. One-shot Learning with Memory-Augmented Neural Networks , 2016, ArXiv.

[25] Chris Fallin,et al. CHIPPER: A low-complexity bufferless deflection router , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[26] Yiran Chen,et al. PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[27] Yu Wang,et al. GraphSAR: a sparsity-aware processing-in-memory architecture for large-scale graph processing on ReRAMs , 2019, ASP-DAC.

[28] Yuan Xie,et al. Learning the sparsity for ReRAM: mapping and pruning sparse neural network for ReRAM based accelerator , 2019, ASP-DAC.

[29] Hao Jiang,et al. RENO: A high-efficient reconfigurable neuromorphic computing accelerator design , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[30] Masoud Daneshtalab,et al. Reconfigurable Network-on-Chip for 3D Neural Network Accelerators , 2018, 2018 Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[31] Liam McDaid,et al. Advancing interconnect density for spiking neural network hardware implementations using traffic-aware adaptive network-on-chip routers , 2012, Neural Networks.

[32] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[33] Hoi-Jun Yoo,et al. An Energy-Efficient Embedded Deep Neural Network Processor for High Speed Visual Attention in Mobile Vision Recognition SoC , 2016, IEEE Journal of Solid-State Circuits.

[34] Sudhakar Yalamanchili,et al. DeepTrain: A Programmable Embedded Platform for Training Deep Neural Networks , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[35] Manoj Alwani,et al. Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[36] Hyoukjun Kwon,et al. Rethinking NoCs for spatial neural network accelerators , 2017, 2017 Eleventh IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[37] Joon-Sung Yang,et al. DRIS-3: Deep Neural Network Reliability Improvement Scheme in 3D Die-Stacked Memory based on Fault Analysis , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[38] Xiaowei Li,et al. Learn-to-Scale: Parallelizing Deep Learning Inference on Chip Multiprocessor Architecture , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[39] M. Breitwisch. Phase Change Memory , 2008, 2008 International Interconnect Technology Conference.

[40] Luca Benini,et al. Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded Systems , 2018, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[41] Liam McDaid,et al. Scalable Hierarchical Network-on-Chip Architecture for Spiking Neural Network Hardware Implementations , 2013, IEEE Transactions on Parallel and Distributed Systems.

[42] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[43] Jason Cong,et al. Overcoming Data Transfer Bottlenecks in FPGA-based DNN Accelerators via Layer Conscious Memory Management , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[44] Jacques-Olivier Klein,et al. Spin-Transfer Torque Magnetic Memory as a Stochastic Memristive Synapse for Neuromorphic Systems , 2015, IEEE Transactions on Biomedical Circuits and Systems.

[45] Catherine Graves,et al. Dot-product engine for neuromorphic computing: Programming 1T1M crossbar to accelerate matrix-vector multiplication , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[46] Masoud Daneshtalab,et al. CuPAN - High Throughput On-chip Interconnection for Neural Networks , 2015, ICONIP.

[47] Michael Ferdman,et al. Medusa: A Scalable Interconnect for Many-Port DNN Accelerators and Wide DRAM Controller Interfaces , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[48] Dirk Englund,et al. Freely scalable and reconfigurable optical hardware for deep learning , 2020, Scientific Reports.

[49] Vivienne Sze,et al. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[50] Dipankar Das,et al. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[51] Yiran Chen,et al. GraphR: Accelerating Graph Processing Using ReRAM , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[52] Jie Xu,et al. DeepBurning: Automatic generation of FPGA-based learning accelerators for the Neural Network family , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[53] Jing Li,et al. Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[54] Swagath Venkataramani,et al. Exploiting approximate computing for deep learning acceleration , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[55] Hisashi Shima,et al. Resistive Random Access Memory (ReRAM) Based on Metal Oxides , 2010, Proceedings of the IEEE.

[56] Michael Ferdman,et al. Argus: An End-to-End Framework for Accelerating CNNs on FPGAs , 2019, IEEE Micro.

[57] Engin Ipek,et al. Enabling Scientific Computing on Memristive Accelerators , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[58] Yao Chen,et al. Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs , 2019, FPGA.

[59] Vivienne Sze,et al. Designing Hardware for Machine Learning: The Important Role Played by Circuit Designers , 2017, IEEE Solid-State Circuits Magazine.

[60] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[61] Sherief Reda,et al. Coordinated DVFS and Precision Control for Deep Neural Networks , 2019, IEEE Computer Architecture Letters.

[62] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[63] D. Stewart,et al. The missing memristor found , 2008, Nature.

[64] Dipankar Das,et al. Manna: An Accelerator for Memory-Augmented Neural Networks , 2019, MICRO.

[65] Sujay Deb,et al. Data-flow Aware CNN Accelerator with Hybrid Wireless Interconnection , 2018, 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[66] Tadahiro Kuroda,et al. QUEST: Multi-Purpose Log-Quantized DNN Inference Engine Stacked on 96-MB 3-D SRAM Using Inductive Coupling Technology in 40-nm CMOS , 2019, IEEE Journal of Solid-State Circuits.

[67] Shimeng Yu,et al. Metal–Oxide RRAM , 2012, Proceedings of the IEEE.

[68] Michael Ferdman,et al. Maximizing CNN accelerator efficiency through resource partitioning , 2016, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[69] Huanrui Yang,et al. AtomLayer: A Universal ReRAM-Based CNN Accelerator with Atomic Layer Computation , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[70] Hojjat Adeli,et al. Spiking Neural Networks , 2009, Int. J. Neural Syst..

[71] Xiaoming Chen,et al. moDNN: Memory optimal DNN training on GPUs , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[72] Midia Reshadi,et al. Flow mapping and data distribution on mesh-based deep learning accelerator , 2019, NOCS.

[73] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[74] Wayne Luk,et al. FP-BNN: Binarized neural network on FPGA , 2018, Neurocomputing.

[75] Yehia El-khatib,et al. Adaptive deep learning model selection on embedded systems , 2018, LCTES.

[76] Hyoukjun Kwon,et al. A Communication-Centric Approach for Designing Flexible DNN Accelerators , 2018, IEEE Micro.

[77] Rong Gu,et al. Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[78] Christoforos E. Kozyrakis,et al. TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators , 2019, ASPLOS.

[79] Xuegong Zhou,et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[80] Tao Zhang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[81] John Jose,et al. Exploiting Data Resilience in Wireless Network-on-chip Architectures , 2020, ACM J. Emerg. Technol. Comput. Syst..

[82] Tinoosh Mohsenin,et al. BiNMAC: Binarized neural Network Manycore ACcelerator , 2018, ACM Great Lakes Symposium on VLSI.

[83] William J. Dally,et al. Domain-specific hardware accelerators , 2020, Commun. ACM.

[84] Xiaowei Li,et al. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[85] Sherief Reda,et al. Hardware acceleration of feature detection and description algorithms on low-power embedded platforms , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[86] Ruixuan Li,et al. AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-Deep Neural Networks , 2019, 2019 IEEE 37th International Conference on Computer Design (ICCD).

[87] David Patterson,et al. MLPerf Training Benchmark , 2019, MLSys.

[88] Scott A. Mahlke,et al. DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[89] Partha Pratim Pande,et al. Wireless NoC as Interconnection Backbone for Multicore Chips: Promises and Challenges , 2012, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[90] Vivienne Sze,et al. Using Dataflow to Optimize Energy Efficiency of Deep Neural Network Accelerators , 2017, IEEE Micro.

[91] Denis Foley,et al. Ultra-Performance Pascal GPU and NVLink Interconnect , 2017, IEEE Micro.

[92] Rajesh Gupta,et al. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs , 2017, FPGA.

[93] Dhabaleswar K. Panda,et al. OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training , 2018, 2018 IEEE 25th International Conference on High Performance Computing (HiPC).

[94] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[95] Hyoukjun Kwon,et al. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[96] Tinoosh Mohsenin,et al. Accelerating convolutional neural network with FFT on tiny cores , 2017, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[97] Kiyoung Choi,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[98] Zhenyu Liu,et al. High-Performance FPGA-Based CNN Accelerator With Block-Floating-Point Arithmetic , 2019, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[99] Partha Pratim Pande,et al. Design of an Energy-Efficient CMOS-Compatible NoC Architecture with Millimeter-Wave Wireless Interconnects , 2013, IEEE Transactions on Computers.

[100] Fei Qiao,et al. Concrete: A Per-layer Configurable Framework for Evaluating DNN with Approximate Operators , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[101] Kiyoung Choi,et al. Efficient FPGA acceleration of Convolutional Neural Networks using logical-3D compute array , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[102] Radu Marculescu,et al. Hybrid network-on-chip architectures for accelerating deep learning kernels on heterogeneous manycore platforms , 2016, 2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES).

[103] Masoud Daneshtalab,et al. EbDa: A new theory on design and verification of deadlock-free interconnection networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[104] Xueti Tang,et al. Spin-transfer torque magnetic random access memory (STT-MRAM) , 2013, JETC.

[105] Tajana Simunic,et al. FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[106] Yu Cao,et al. Interconnect-Aware Area and Energy Optimization for In-Memory Acceleration of DNNs , 2020, IEEE Design & Test.

[107] Siddharth Joshi,et al. Author Correction: Ferroelectric ternary content-addressable memory for one-shot learning , 2019, Nature Electronics.

[108] PalesiMaurizio,et al. Exploiting Data Resilience in Wireless Network-on-chip Architectures , 2020 .

[109] Joel Emer,et al. Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[110] Kunle Olukotun,et al. DAWNBench : An End-to-End Deep Learning Benchmark and Competition , 2017 .

[111] Paul Ampadu,et al. Energy-efficient and high-performance NoC architecture and mapping solution for deep neural networks , 2019, NOCS.

[112] Jim D. Garside,et al. Overview of the SpiNNaker System Architecture , 2013, IEEE Transactions on Computers.

[113] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[114] Xi Chen,et al. FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[115] Tajana Simunic,et al. GRAM: graph processing in a ReRAM-based computational memory , 2019, ASP-DAC.

[116] Bruce M. Maggs,et al. On-line algorithms for path selection in a nonblocking network , 1990, STOC '90.

[117] Haichen Shen,et al. Nexus: a GPU cluster engine for accelerating DNN-based video analysis , 2019, SOSP.

[118] Shengen Yan,et al. GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training , 2019, IEEE Transactions on Big Data.

[119] Amir Masoud Rahmani,et al. DNN pruning and mapping on NoC-Based communication infrastructure , 2019, Microelectron. J..

[120] George Bosilca,et al. Efficient parallelization of batch pattern training algorithm on many-core and cluster architectures , 2013, 2013 IEEE 7th International Conference on Intelligent Data Acquisition and Advanced Computing Systems (IDAACS).

[121] Jinjun Xiong,et al. DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[122] John Jose,et al. Approximate Wireless Networks-on-Chip , 2018, 2018 Conference on Design of Circuits and Integrated Systems (DCIS).

[123] Kun-Chih Chen,et al. NoC-based DNN accelerator: a future design paradigm , 2019, NOCS.

[124] Vivienne Sze,et al. Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.