SoBS-X: Squeeze-Out Bit Sparsity for ReRAM-Crossbar-Based Neural Network Accelerator

Resistive random-access-memory (ReRAM) crossbar is a promising technique for deep neural network (DNN) accelerators, thanks to its in-memory and in-situ analog computing abilities for vector–matrix multiplication-and-accumulations (VMMs). However, it is challenging for crossbar architecture to exploit the sparsity in DNNs. It is inevitably complex and costly to exploit fine-grained sparsity due to the limitation of the tightly coupled crossbar structure. As a countermeasure, we develop a novel ReRAM-based DNN accelerator, named sparse-multiplication-engine (SME), based on a hardware and software co-design framework. First, we orchestrate the bit-sparse pattern to increase the density of bit-sparsity based on existing quantization methods. Such quantized weights can be nicely generated using the alternating direction method of multipliers (ADMM) optimization during the DNN fine-tuning, which can exactly enforce bit patterns in weights. Second, we propose a novel weight mapping mechanism to slice the bits of the weight across crossbars and splice the activation results in peripheral circuits. This mechanism can decouple the tightly coupled crossbar structure and cumulate the sparsity in the crossbar. Finally, a superior squeeze-out scheme empties the crossbars mapped with highly sparse nonzeros from the previous two steps. We design the SME architecture and discuss its use for other quantization methods and different ReRAM cell technologies. We further propose a workload grouping algorithm and a pipeline to achieve workload balance among crossbar-rows that concurrently execute multiply–accumulate operations to optimize the system latency. Putting all together, with the optimized model, compared with prior state-of-the-art designs, the SME shrinks the use of crossbars up to $8.7\times $ and $2.1\times $ using ResNet-50 and MobileNet-v2, respectively, and achieve average $3.1\times $ speed up with no or little accuracy loss on ImageNet.

[1]  Yiran Chen,et al.  IVQ: In-Memory Acceleration of DNN Inference Exploiting Varied Quantization , 2022, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[2]  Li Jiang,et al.  Bit-Transformer: Transforming Bit-level Sparsity into Higher Preformance in ReRAM-based Accelerator , 2021, 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[3]  Yanzhi Wang,et al.  Improving Neural Network Efficiency via Post-training Quantization with Adaptive Floating-Point , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Li Jiang,et al.  IM3A: Boosting Deep Neural Network Efficiency via In-Memory Addressing-Assisted Acceleration , 2021, ACM Great Lakes Symposium on VLSI.

[5]  Hang Liu,et al.  FORMS: Fine-grained Polarized ReRAM-based In-situ Computation for Mixed-signal DNN Accelerator , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[6]  Wenbo Zhao,et al.  SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit Sparsity of Neural Network , 2021, 2021 IEEE 39th International Conference on Computer Design (ICCD).

[7]  Yiran Chen,et al.  BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization , 2021, ICLR.

[8]  Xin Si,et al.  15.4 A 5.99-to-691.1TOPS/W Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity-Based Optimization and Variable-Precision Quantization , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[9]  Yanzhi Wang,et al.  Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[10]  Yiran Chen,et al.  ReTransformer: ReRAM-based Processing-in-Memory Architecture for Transformer Acceleration , 2020, 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[11]  Yanzhi Wang,et al.  PIM-Prune: Fine-Grain DCNN Pruning for Crossbar-Based Process-In-Memory Architecture , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[12]  Yu Wang,et al.  Low Bit-Width Convolutional Neural Network on RRAM , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[13]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[14]  Yue Wang,et al.  SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost Computation , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[15]  Vijaykrishnan Narayanan,et al.  GaaS-X: Graph Analytics Accelerator Supporting Sparse Data Representation using Crossbar Architectures , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[16]  Yuan Xie,et al.  Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey , 2020, Proceedings of the IEEE.

[17]  Xiaochen Peng,et al.  DNN+NeuroSim V2.0: An End-to-End Benchmarking Framework for Compute-in-Memory Accelerators for On-Chip Training , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[18]  Bin Gao,et al.  Fully hardware-implemented memristor convolutional neural network , 2020, Nature.

[19]  Lei Deng,et al.  SemiMap: A Semi-Folded Convolution Mapping for Speed-Overhead Balance on Crossbars , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[20]  Yanzhi Wang,et al.  PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning , 2020, ASPLOS.

[21]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[22]  Wei Tang,et al.  CASCADE: Connecting RRAMs to Extend Analog Dataflow In An End-To-End In-Memory Processing Paradigm , 2019, MICRO.

[23]  Wei Wang,et al.  Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks , 2019, International Conference on Learning Representations.

[24]  Chia-Lin Yang,et al.  Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity in Compressed Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[25]  Yanzhi Wang,et al.  ResNet Can Be Pruned 60×: Introducing Network Purification and Unused Path Removal (P-RM) after Weight Pruning , 2019, 2019 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH).

[26]  Yuan Xie,et al.  FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture , 2019, ASPLOS.

[27]  Yuan Xie,et al.  Learning the sparsity for ReRAM: mapping and pruning sparse neural network for ReRAM based accelerator , 2019, ASP-DAC.

[28]  Zhijian Liu,et al.  HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yuan Xie,et al.  Crossbar-Aware Neural Network Pruning , 2018, IEEE Access.

[30]  Yongqiang Lyu,et al.  SNrram: An Efficient Sparse Neural Network Computation Architecture Based on Resistive Random-Access Memory , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[31]  Scott A. Mahlke,et al.  In-Memory Data Parallel Processor , 2018, ASPLOS.

[32]  Yiran Chen,et al.  ReCom: An efficient resistive accelerator for compressed deep neural networks , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[33]  Meng-Fan Chang,et al.  A 65nm 1Mb nonvolatile computing-in-memory ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN AI edge processors , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[34]  Chi-Ying Tsui,et al.  A high-throughput and energy-efficient RRAM-based convolutional neural network using data encoding and dynamic quantization , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).

[35]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Yu Wang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[38]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Meng-Fan Chang,et al.  19.4 embedded 1Mb ReRAM in 28nm CMOS with 0.27-to-1V read using swing-sample-and-couple sense amplifier and self-boost-write-termination scheme , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[42]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[43]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .