论文信息 - SPCIM: Sparsity-Balanced Practical CIM Accelerator With Optimized Spatial-Temporal Multi-Macro Utilization

SPCIM: Sparsity-Balanced Practical CIM Accelerator With Optimized Spatial-Temporal Multi-Macro Utilization

Compute-in-memory (CIM) is a promising technique that reduces data movement in neural network (NN) acceleration. To achieve higher efficiency, some recent CIM accelerators exploit NN sparsity based on CIM’s small-grained operation unit (OU) feature. However, new problems arise in a practical multi-macro accelerator: The mismatch between workload parallelism and CIM macro organization causes spatial under-utilization; The multiple macros’ different computation time leads to temporal under-utilization. To solve the under-utilization problems, we propose a Sparsity-balanced Practical CIM accelerator (SPCIM), including optimized dataflow and hardware architecture design. For the CIM dataflow design, we first propose a reconfigurable cluster topology for CIM macro organization. Then we regularize weight sparsity in the OU-height pattern and reorder the weight matrix based on the sparsity ratio. The cluster topology can be reshaped to match workload parallelism for higher spatial utilization. Each CIM cluster’s workload is dynamically rebalanced for higher temporal utilization. Our hardware architecture supports the proposed dataflow with a spatial input dispatcher and a temporal workload allocator. Experimental results show that, compared with the baseline sparse CIM accelerator that suffers from spatial and temporal under-utilization, SPCIM achieves <inline-formula> <tex-math notation="LaTeX">$2.94\times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math notation="LaTeX">$2.86\times $ </tex-math></inline-formula> energy saving. The proposed sparsity-balanced dataflow and architecture are generic and scalable, which can be applied to other CIM accelerators. We strengthen two state-of-the-art CIM accelerators with the SPCIM techniques, improving their energy efficiency by <inline-formula> <tex-math notation="LaTeX">$1.92\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$5.59\times $ </tex-math></inline-formula>, respectively.

[1] Ziyue Wang,et al. More is Less: Domain-Specific Speech Recognition Microprocessor Using One-Dimensional Convolutional Recurrent Neural Network , 2022, IEEE Transactions on Circuits and Systems I: Regular Papers.

[2] C. Lo,et al. A 28nm 1Mb Time-Domain Computing-in-Memory 6T-SRAM Macro with a 6.6ns Latency, 1241GOPS and 37.01TOPS/W for 8b-MAC Operations for Edge-AI Devices , 2022, 2022 IEEE International Solid- State Circuits Conference (ISSCC).

[3] Y. Chih,et al. An 8-Mb DC-Current-Free Binary-to-8b Precision ReRAM Nonvolatile Computing-in-Memory Macro using Time-Space-Readout with 1286.4-21.6TOPS/W for Edge-AI Devices , 2022, 2022 IEEE International Solid- State Circuits Conference (ISSCC).

[4] Michael B. Henry,et al. Analog Matrix Processor for Edge AI Real-Time Video Analytics , 2022, 2022 IEEE International Solid- State Circuits Conference (ISSCC).

[5] Hsiang-Yun Cheng,et al. RePIM: Joint Exploitation of Activation and Weight Repetitions for In-ReRAM DNN Acceleration , 2021, 2021 58th ACM/IEEE Design Automation Conference (DAC).

[6] Hongyang Jia,et al. A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[7] Nan Sun,et al. A 2.75-to-75.9TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[8] Xin Si,et al. 15.4 A 5.99-to-691.1TOPS/W Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity-Based Optimization and Variable-Precision Quantization , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[9] Chung-Chuan Lo,et al. 16.3 A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b Precision for AI Edge Chips , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[10] Chung-Chuan Lo,et al. A 22nm 4Mb 8b-Precision ReRAM Computing-in-Memory Macro with 11.91 to 195.7TOPS/W for Tiny AI Edge Devices , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[11] Longxing Shi,et al. A 22nm, 10.8 μ W/15.1 μ W Dual Computing Modes High Power-Performance-Area Efficiency Domained Background Noise Aware Keyword- Spotting Processor , 2020, IEEE Transactions on Circuits and Systems I: Regular Papers.

[12] Meng-Fan Chang,et al. 15.5 A 28nm 64Kb 6T SRAM Computing-in-Memory Macro with 8b MAC Operation for AI Edge Chips , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[13] Meng-Fan Chang,et al. 15.4 A 22nm 2Mb ReRAM Compute-in-Memory Macro with 121-28TOPS/W for Multibit MAC Computing for Tiny AI Edge Devices , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[14] Meng-Fan Chang,et al. 14.3 A 65nm Computing-in-Memory-Based CNN Processor with 2.9-to-35.8TOPS/W System Energy Efficiency Using Dynamic-Sparsity Performance-Scaling Architecture and Energy-Efficient Inter/Intra-Macro Data Reuse , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[15] Vivienne Sze,et al. Design Considerations for Efficient Deep Neural Networks on Processing-in-Memory Accelerators , 2019, 2019 IEEE International Electron Devices Meeting (IEDM).

[16] Chia-Lin Yang,et al. Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity in Compressed Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[17] Meng-Fan Chang,et al. 24.5 A Twin-8T SRAM Computation-In-Memory Macro for Multiple-Bit CNN-Based Machine Learning , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[18] Meng-Fan Chang,et al. 24.1 A 1Mb Multibit ReRAM Computing-In-Memory Macro with 14.6ns Parallel MAC Computing Time for CNN Based AI Edge Processors , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[19] Meng-Fan Chang,et al. DL-RSIM: A Simulation Framework to Enable Reliable ReRAM-based Accelerators for Deep Learning , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[20] Meng-Fan Chang,et al. A 65nm 1Mb nonvolatile computing-in-memory ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN AI edge processors , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[21] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Yu Wang,et al. PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[23] Miao Hu,et al. ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[24] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] H. L. Lung,et al. A Study of Array Resistance Distribution and a Novel Operation Algorithm for WO x ReRAM Memory , 2015 .

[26] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.