TT@CIM: A Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity Optimization and Variable Precision Quantization

Computing-in-memory (CIM) is an attractive approach for energy-efficient deep neural network (DNN) processing, especially for low-power edge devices. However, today’s typical DNNs usually exceed CIM-static random access memory (SRAM) capacity. The introduced off-chip communication covers up the benefits of CIM technique, meaning that CIM processors still encounter the memory bottleneck. To eliminate this bottleneck, we propose a CIM processor, called TT@CIM, which applies the tensor-train decomposition (TTD) method to compress the entire DNN to fit within CIM-SRAM. However, the cost of storage reduction by TTD is to introduce multiple serial small-size matrix multiplications, resulting in massive inefficient multiply-and-accumulate (MAC) and quantization operations (QuantOps). To achieve high energy efficiency, three optimization techniques are proposed in TT@CIM. First, TTD-CIM-matched dataflow is proposed to maximize CIM utilization and minimize additional MAC operations. Second, a bit-level-sparsity-optimized CIM macro with high bit-level-sparsity encoding scheme is designed to reduce the power consumption of one MAC operation. Third, a variable precision quantization method and a lookup table-based quantization unit are presented to improve the performance and energy efficiency of QuantOp. Fabricated in 28-nm CMOS and tested on 4/8-bit decomposed DNNs, TT@CIM achieves 5.99-to-691.13-TOPS/W peak energy efficiency depending on the operating voltage.

[1]  Sujan Kumar Gonugondla,et al.  A 0.44-μJ/dec, 39.9-μs/dec, Recurrent Attention In-Memory Processor for Keyword Spotting , 2021, IEEE Journal of Solid-State Circuits.

[2]  Chung-Chuan Lo,et al.  A Local Computing Cell and 6T SRAM-Based Computing-in-Memory Macro With 8-b MAC Operation for Edge AI Chips , 2021, IEEE Journal of Solid-State Circuits.

[3]  Bo Yuan,et al.  Towards Extremely Compact RNNs for Video Recognition with Fully Decomposed Hierarchical Tucker Structure , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Hongyang Jia,et al.  A Programmable Neural-Network Inference Accelerator Based on Scalable In-Memory Computing , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[5]  Xin Si,et al.  15.4 A 5.99-to-691.1TOPS/W Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity-Based Optimization and Variable-Precision Quantization , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[6]  Hidehiro Fujiwara,et al.  An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[7]  Jie Gu,et al.  15.3 A 65nm 3T Dynamic Analog RAM-Based Computing-in-Memory Macro and CNN Accelerator with Retention Enhancement, Adaptive Analog Sparsity and 44TOPS/W System Energy Efficiency , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[8]  Nan Sun,et al.  A 2.75-to-75.9TOPS/W Computing-in-Memory NN Processor Supporting Set-Associate Block-Wise Zero Skipping and Ping-Pong CIM with Simultaneous Computation and Weight Updating , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[9]  Yang Wang,et al.  Evolver: A Deep Learning Processor With On-Device Quantization–Voltage–Frequency Tuning , 2021, IEEE Journal of Solid-State Circuits.

[10]  Mahmut E. Sinangil,et al.  A 7-nm Compute-in-Memory SRAM Macro Supporting Multi-Bit Input, Weight and Output and Achieving 351 TOPS/W and 372.4 GOPS , 2021, IEEE Journal of Solid-State Circuits.

[11]  Meng-Fan Chang,et al.  A 4-Kb 1-to-8-bit Configurable 6T SRAM-Based Computation-in-Memory Unit-Macro for CNN-Based AI Edge Processors , 2020, IEEE Journal of Solid-State Circuits.

[12]  Jae-sun Seo,et al.  C3SRAM: An In-Memory-Computing SRAM Macro Based on Robust Capacitive Coupling Computing Mechanism , 2020, IEEE Journal of Solid-State Circuits.

[13]  Yinqi Tang,et al.  A Programmable Heterogeneous Microprocessor Based on Bit-Scalable In-Memory Computing , 2020, IEEE Journal of Solid-State Circuits.

[14]  Meng-Fan Chang,et al.  15.5 A 28nm 64Kb 6T SRAM Computing-in-Memory Macro with 8b MAC Operation for AI Edge Chips , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[15]  Shih-Chieh Chang,et al.  15.2 A 28nm 64Kb Inference-Training Two-Way Transpose Multibit 6T SRAM Compute-in-Memory Macro for AI Edge Chips , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[16]  Meng-Fan Chang,et al.  A Twin-8T SRAM Computation-in-Memory Unit-Macro for Multibit CNN-Based AI Edge Processors , 2020, IEEE Journal of Solid-State Circuits.

[17]  Meng-Fan Chang,et al.  A Dual-Split 6T SRAM-Based Computing-in-Memory Unit-Macro With Fully Parallel Product-Sum Operation for Binarized DNN Edge Processors , 2019, IEEE Transactions on Circuits and Systems I: Regular Papers.

[18]  Anantha P. Chandrakasan,et al.  CONV-SRAM: An Energy-Efficient SRAM With In-Memory Dot-Product Computation for Low-Power Convolutional Neural Networks , 2019, IEEE Journal of Solid-State Circuits.

[19]  Zenglin Xu,et al.  Compressing Recurrent Neural Networks with Tensor Ring for Action Recognition , 2018, AAAI.

[20]  Jiayu Li,et al.  ADAM-ADMM: A Unified, Systematic Framework of Structured Weight Pruning for DNNs , 2018, ArXiv.

[21]  Jae-sun Seo,et al.  XNOR-SRAM: In-Memory Computing SRAM Macro for Binary/Ternary Deep Neural Networks , 2018, 2018 IEEE Symposium on VLSI Technology.

[22]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Zenglin Xu,et al.  Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Volker Tresp,et al.  Tensor-Train Recurrent Neural Networks for Video Classification , 2017, ICML.

[25]  Alexander Novikov,et al.  Ultimate tensorization: compressing convolutional and FC layers alike , 2016, ArXiv.

[26]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[27]  Vivienne Sze,et al.  Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[28]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[29]  Alexander Novikov,et al.  Tensorizing Neural Networks , 2015, NIPS.

[30]  Ivan Oseledets,et al.  Tensor-Train Decomposition , 2011, SIAM J. Sci. Comput..

[31]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.