论文信息 - Reconfigurability, Why It Matters in AI Tasks Processing: A Survey of Reconfigurable AI Chips

Reconfigurability, Why It Matters in AI Tasks Processing: A Survey of Reconfigurable AI Chips

Nowadays, artificial intelligence (AI) technologies, especially deep neural networks (DNNs), play an vital role in solving many problems in both academia and industry. In order to simultaneously meet the demand of performance, energy efficiency and flexibility in DNN processing, various reconfigurable AI chips have been proposed in the past several years. They are based on FPGA or CGRA platforms and have domain-specific reconfigurability to customize the computing units and data paths for different DNN tasks without re-produce the chips. This paper surveys typical reconfigurable AI chips from three reconfiguration hierarchies: processing element level, processing element array level, and chip level. Each reconfiguration hierarchy covers a set of important optimization techniques for DNN computation which are frequently adopted in real life. This paper lists the reconfigurable AI chip works in chronological order, discusses the hardware development process for each optimization techniques, and analyzes the necessity of reconfigurability in AI tasks processing. The trends of each reconfiguration hierarchy and insights about the cooperation of techniques from different hierarchies are also proposed.

[1] S. Kwon,et al. A Multi-Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Unified Multi-Precision Datapath in 4-nm Flagship Mobile SoC , 2023, IEEE Journal of Solid-State Circuits.

[2] V. Gadepally,et al. AI and ML Accelerator Survey and Trends , 2022, 2022 IEEE High Performance Extreme Computing Conference (HPEC).

[3] R. Balasubramonian,et al. CANDLES: Channel-Aware Novel Dataflow-Microarchitecture Co-Design for Low Energy Sparse Neural Network Acceleration , 2022, 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[4] Ardavan Pedram,et al. Griffin: Rethinking Sparse Optimization for Deep Learning Architectures , 2021, 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[5] Matthew Mattina,et al. S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration , 2021, 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[6] Siyuan Qiu,et al. An Energy-Efficient Low-Latency 3D-CNN Accelerator Leveraging Temporal Locality, Full Zero-Skipping, and Hierarchical Load Balance , 2021, 2021 58th ACM/IEEE Design Automation Conference (DAC).

[7] Jeremy Kepner,et al. AI Accelerator Survey and Trends , 2021, 2021 IEEE High Performance Extreme Computing Conference (HPEC).

[8] J. Doifode,et al. A Survey Paper on Acceleration of Convolutional Neural Network using Field Programmable Gate Arrays , 2021, 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT).

[9] Yang Wang,et al. A 28nm 276.55TFLOPS/W Sparse Deep-Neural-Network Training Processor with Implicit Redundancy Speculation and Batch Normalization Reformulation , 2021, 2021 Symposium on VLSI Circuits.

[10] Joel Silberman,et al. RaPiD: AI Accelerator for Ultra-low Precision Training and Inference , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[11] Hanwoong Jung,et al. Sparsity-Aware and Re-configurable NPU Architecture for Samsung Flagship Mobile SoC , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[12] Yang Wang,et al. Dual-side Sparse Tensor Core , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[13] Hanwoong Jung,et al. 9.5 A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[14] Marian Verhelst,et al. 9.4 PIU: A 248GOPS/W Stream-Based Processor for Irregular Probabilistic Inference Networks Using Precision-Scalable Posit Arithmetic in 28nm , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[15] Hyoukjun Kwon,et al. Heterogeneous Dataflow Accelerators for Multi-DNN Workloads , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[16] H. Yoo,et al. An Overview of Energy-Efficient Hardware Accelerators for On-Device Deep-Neural-Network Training , 2021, IEEE Open Journal of the Solid-State Circuits Society.

[17] Muhammad Shafique,et al. Hardware and Software Optimizations for Accelerating Deep Neural Networks: Survey of Current Trends, Challenges, and the Road Ahead , 2020, IEEE Access.

[18] M. Eleuldj,et al. Survey of Deep Learning Neural Networks Implementation on FPGAs , 2020, 2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech).

[19] Guy Lemieux,et al. Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20] Jeremy Kepner,et al. Survey of Machine Learning Accelerators , 2020, 2020 IEEE High Performance Extreme Computing Conference (HPEC).

[21] Dongup Kwon,et al. A Multi-Neural Network Acceleration Architecture , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[22] Yuan Xie,et al. Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey , 2020, Proceedings of the IEEE.

[23] Yiran Chen,et al. A Survey of Accelerator Architectures for Deep Neural Networks , 2020 .

[24] Xuehai Qian,et al. AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[25] Chia-Hung Liu,et al. 7.1 A 3.4-to-13.3TOPS/W 3.6TOPS Dual-Core Deep-Learning Accelerator for Versatile AI Applications in 7nm 5G Smartphone SoC , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[26] Hoi-Jun Yoo,et al. 7.4 GANPU: A 135TFLOPS/W Multi-DNN Training Processor for GANs with Speculative Dual-Sparsity Exploitation , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[27] Dipankar Das,et al. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[28] Nirali R. Nanavati,et al. Efficient Hardware Implementations of Deep Neural Networks: A Survey , 2020, 2020 Fourth International Conference on Inventive Systems and Control (ICISC).

[29] Yangdong Deng,et al. A Survey of Coarse-Grained Reconfigurable Architecture and Design , 2019, ACM Comput. Surv..

[30] Aamer Jaleel,et al. ExTensor: An Accelerator for Sparse Tensor Algebra , 2019, MICRO.

[31] Patrick Judd,et al. ShapeShifter: Enabling Fine-Grain Data Width Adaptation in Deep Learning , 2019, MICRO.

[32] Jeremy Kepner,et al. Survey and Benchmarking of Machine Learning Accelerators , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[33] Christoforos E. Kozyrakis,et al. TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators , 2019, ASPLOS.

[34] Leibo Liu,et al. A High Throughput Acceleration for Hybrid Neural Networks With Efficient Resource Management on FPGA , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[35] Leibo Liu,et al. An Energy-Efficient Reconfigurable Processor for Binary-and Ternary-Weight Neural Networks With Flexible Data Bit Width , 2019, IEEE Journal of Solid-State Circuits.

[36] David Wentzlaff,et al. The Accelerator Wall: Limits of Chip Specialization , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[37] Hoi-Jun Yoo,et al. 7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16 , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[38] Youngwoo Kim,et al. A 2.1TFLOPS/W Mobile Deep RL Accelerator with Transposable PE Array and Experience Compression , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[39] Meng-Fan Chang,et al. 7.5 A 65nm 0.39-to-140.3TOPS/W 1-to-12b Unified Neural Network Processor Using Block-Circulant-Enabled Transpose-Domain Acceleration with 8.1 × Higher TOPS/mm2and 6T HBST-TRAM-Based 2D Data-Reuse Architecture , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[40] Jae-Gon Lee,et al. 7.1 An 11.5TOPS/W 1024-MAC Butterfly Structure Dual-Core Sparsity-Aware Neural Processing Unit in 8nm Flagship Mobile SoC , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[41] Xuehai Qian,et al. HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[42] Hoi-Jun Yoo,et al. UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision , 2019, IEEE Journal of Solid-State Circuits.

[43] Zhijian Liu,et al. HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44] H. T. Kung,et al. Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization , 2018, ASPLOS.

[45] Jae-Joon Han,et al. Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Minjie Wang,et al. Supporting Very Large Models using Automatic Dataflow Graph Partitioning , 2018, EuroSys.

[47] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[48] Vivienne Sze,et al. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[49] Alessandro Aimar,et al. NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[50] Mauricio Acconcia Dias,et al. Deep Learning in Reconfigurable Hardware: A Survey , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[51] Tianshi Chen,et al. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[52] Chunhua Deng,et al. PermDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[53] Jongsun Park,et al. Mosaic-CNN: A Combined Two-Step Zero Prediction Approach to Trade off Accuracy and Computation Energy in Convolutional Neural Networks , 2018, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[54] Yanzhi Wang,et al. StructADMM: A Systematic, High-Efficiency Framework of Structured Weight Pruning for DNNs , 2018, 1807.11091.

[55] G. Hua,et al. LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.

[56] Tao Li,et al. Prediction Based Execution on Deep Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[57] Xiangyu Li,et al. LCP: a Layer Clusters Paralleling mapping method for accelerating Inception and Residual networks on FPGA , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[58] Rajesh K. Gupta,et al. SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[59] Minjie Wang,et al. Unifying Data, Model and Hybrid Parallelism in Deep Learning via Tensor Tiling , 2018, ArXiv.

[60] David Blaauw,et al. Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[61] David Blaauw,et al. OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[62] Hyoukjun Kwon,et al. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[63] Rajeev Balasubramonian,et al. Moving CNN Accelerator Computations Closer to Data , 2018, 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2).

[64] Bo Chen,et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65] Hadi Esmaeilzadeh,et al. Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[66] Wei Pan,et al. Towards Accurate Binary Convolutional Neural Network , 2017, NIPS.

[67] Yiran Chen,et al. MeDNN: A distributed mobile system with enhanced partition and deployment for large-scale DNNs , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[68] Xu Sun,et al. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting , 2017, ICML.

[69] Samy Bengio,et al. Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[70] William J. Dally,et al. SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[71] Leibo Liu,et al. Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[72] Vivienne Sze,et al. Efficient Processing of Deep Neural Networks: A Tutorial and Survey , 2017, Proceedings of the IEEE.

[73] Xiaowei Li,et al. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[74] Michael Ferdman,et al. Maximizing CNN accelerator efficiency through resource partitioning , 2016, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[75] V. Sze,et al. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2016, IEEE Journal of Solid-State Circuits.

[76] Jason Cong,et al. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[77] Shaoli Liu,et al. Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[78] Manoj Alwani,et al. Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[79] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[80] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[81] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[82] Ran El-Yaniv,et al. Binarized Neural Networks , 2016, NIPS.

[83] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[84] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85] Patrick Judd,et al. Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[86] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[87] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[88] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[89] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[90] Brendan J. Frey,et al. Winner-Take-All Autoencoders , 2014, NIPS.

[91] Ming Yang,et al. Compressing Deep Convolutional Networks using Vector Quantization , 2014, ArXiv.

[92] Alex Krizhevsky,et al. One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[93] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[94] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.