论文信息 - TT@CIM: A Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity Optimization and Variable Precision Quantization

TT@CIM: A Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity Optimization and Variable Precision Quantization

Computing-in-memory (CIM) is an attractive approach for energy-efficient deep neural network (DNN) processing, especially for low-power edge devices. However, today’s typical DNNs usually exceed CIM-static random access memory (SRAM) capacity. The introduced off-chip communication covers up the benefits of CIM technique, meaning that CIM processors still encounter the memory bottleneck. To eliminate this bottleneck, we propose a CIM processor, called TT@CIM, which applies the tensor-train decomposition (TTD) method to compress the entire DNN to fit within CIM-SRAM. However, the cost of storage reduction by TTD is to introduce multiple serial small-size matrix multiplications, resulting in massive inefficient multiply-and-accumulate (MAC) and quantization operations (QuantOps). To achieve high energy efficiency, three optimization techniques are proposed in TT@CIM. First, TTD-CIM-matched dataflow is proposed to maximize CIM utilization and minimize additional MAC operations. Second, a bit-level-sparsity-optimized CIM macro with high bit-level-sparsity encoding scheme is designed to reduce the power consumption of one MAC operation. Third, a variable precision quantization method and a lookup table-based quantization unit are presented to improve the performance and energy efficiency of QuantOp. Fabricated in 28-nm CMOS and tested on 4/8-bit decomposed DNNs, TT@CIM achieves 5.99-to-691.13-TOPS/W peak energy efficiency depending on the operating voltage.

[1] Sujan Kumar Gonugondla,et al. A 0.44-μJ/dec, 39.9-μs/dec, Recurrent Attention In-Memory Processor for Keyword Spotting , 2021, IEEE Journal of Solid-State Circuits.

[2] Chung-Chuan Lo,et al. A Local Computing Cell and 6T SRAM-Based Computing-in-Memory Macro With 8-b MAC Operation for Edge AI Chips , 2021, IEEE Journal of Solid-State Circuits.

[3] Bo Yuan,et al. Towards Extremely Compact RNNs for Video Recognition with Fully Decomposed Hierarchical Tucker Structure , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Hongyang Jia,et al. A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[5] Xin Si,et al. 15.4 A 5.99-to-691.1TOPS/W Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity-Based Optimization and Variable-Precision Quantization , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[6] Hidehiro Fujiwara,et al. An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[7] Jie Gu,et al. 15.3 A 65nm 3T Dynamic Analog RAM-Based Computing-in-Memory Macro and CNN Accelerator with Retention Enhancement, Adaptive Analog Sparsity and 44TOPS/W System Energy Efficiency , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[8] Nan Sun,et al. A 2.75-to-75.9TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[9] Yang Wang,et al. Evolver: A Deep Learning Processor With On-Device Quantization–Voltage–Frequency Tuning , 2021, IEEE Journal of Solid-State Circuits.

[10] Mahmut E. Sinangil,et al. A 7-nm Compute-in-Memory SRAM Macro Supporting Multi-Bit Input, Weight and Output and Achieving 351 TOPS/W and 372.4 GOPS , 2021, IEEE Journal of Solid-State Circuits.

[11] Meng-Fan Chang,et al. A 4-Kb 1-to-8-bit Configurable 6T SRAM-Based Computation-in-Memory Unit-Macro for CNN-Based AI Edge Processors , 2020, IEEE Journal of Solid-State Circuits.

[12] Jae-sun Seo,et al. C3SRAM: An In-Memory-Computing SRAM Macro Based on Robust Capacitive Coupling Computing Mechanism , 2020, IEEE Journal of Solid-State Circuits.

[13] Yinqi Tang,et al. A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In-Memory Computing , 2020, IEEE Journal of Solid-State Circuits.

[14] Meng-Fan Chang,et al. 15.5 A 28nm 64Kb 6T SRAM Computing-in-Memory Macro with 8b MAC Operation for AI Edge Chips , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[15] Shih-Chieh Chang,et al. 15.2 A 28nm 64Kb Inference-Training Two-Way Transpose Multibit 6T SRAM Compute-in-Memory Macro for AI Edge Chips , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[16] Meng-Fan Chang,et al. A Twin-8T SRAM Computation-in-Memory Unit-Macro for Multibit CNN-Based AI Edge Processors , 2020, IEEE Journal of Solid-State Circuits.

[17] Meng-Fan Chang,et al. A Dual-Split 6T SRAM-Based Computing-in-Memory Unit-Macro With Fully Parallel Product-Sum Operation for Binarized DNN Edge Processors , 2019, IEEE Transactions on Circuits and Systems I: Regular Papers.

[18] Anantha P. Chandrakasan,et al. CONV-SRAM: An Energy-Efficient SRAM With In-Memory Dot-Product Computation for Low-Power Convolutional Neural Networks , 2019, IEEE Journal of Solid-State Circuits.

[19] Zenglin Xu,et al. Compressing Recurrent Neural Networks with Tensor Ring for Action Recognition , 2018, AAAI.

[20] Jiayu Li,et al. ADAM-ADMM: A Unified, Systematic Framework of Structured Weight Pruning for DNNs , 2018, ArXiv.

[21] Jae-sun Seo,et al. XNOR-SRAM: In-Memory Computing SRAM Macro for Binary/Ternary Deep Neural Networks , 2018, 2018 IEEE Symposium on VLSI Technology.

[22] Bo Chen,et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23] Zenglin Xu,et al. Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24] Volker Tresp,et al. Tensor-Train Recurrent Neural Networks for Video Classification , 2017, ICML.

[25] Alexander Novikov,et al. Ultimate tensorization: compressing convolutional and FC layers alike , 2016, ArXiv.

[26] Yiran Chen,et al. Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[27] Vivienne Sze,et al. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[28] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[29] Alexander Novikov,et al. Tensorizing Neural Networks , 2015, NIPS.

[30] Ivan Oseledets,et al. Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[31] Jiebo Luo,et al. Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.