IVQ: In-Memory Acceleration of DNN Inference Exploiting Varied Quantization

Weight quantization is well adapted to cope with the ever-growing complexity of the deep neural network (DNN) model. Diversified quantization schemes lead to diverse quantized bit width and formats of the weights, thereby, subject to different hardware implementations. Such variety prevents a general NPU to leverage different quantization schemes to gain performance and energy efficiency. More importantly, a trend of quantization diversity emerges that applies multiple quantization schemes to different fine-grained structures (e.g., a layer or a channel of weight) of a DNN. Therefore, a general architecture is desired to exploit varied quantization schemes. The crossbar-based processing-in-memory (PIM) architecture, a promising DNN accelerator, is well known for its highly efficient matrix-vector multiplication. However, PIM suffers from the inflexible intracrossbar data path because the weight is stationary on the crossbar and binds to the “add” operation along the bitline. Therefore, many nonuniform quantization methods must rollback the quantization before mapping the weights onto the crossbar. Counterintuitively, this article discovers a unique opportunity of the PIM architecture to exploit varied quantization schemes. We first transform the quantization diversity problem into a consistency problem by aligning the bit with the same magnitude along the same bitline of the crossbar. Consequently, such naive weight mapping causes many square hollows of idle PIM cells. We then propose a novel spatial mapping to exempt these “hollow” crossbar from the intercrossbar data path. To further squeeze the weights on fewer crossbars, we decouple the intracrossbar data path from the hardware bitline by a novel temporal scheduling, so that bits with different magnitudes can be placed on cells along the same bitline. Finally, the proposed IVQ includes a temporal pipeline to avoid the introduced stalling cycles, and a data flow with delicate control mechanisms for the new intra and intercrossbar data paths. Putting all together, IVQ achieves <inline-formula> <tex-math notation="LaTeX">$19.7\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$10.7\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$4.7\times \sim 63.4\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$91.7\times $ </tex-math></inline-formula> speedup, and <inline-formula> <tex-math notation="LaTeX">$17.7\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$5.1\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$5.7\times \sim 68.1\times $ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$541\times $ </tex-math></inline-formula> energy savings over two PIM accelerators (ISAAC and CASCADE), two customized quantization accelerators (based on ASIC and FPGA), and NVIDIA RTX 2080 GPU, respectively.

[1]  Li Jiang,et al.  Bit-Transformer: Transforming Bit-level Sparsity into Higher Preformance in ReRAM-based Accelerator , 2021, 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[2]  Yanzhi Wang,et al.  Improving Neural Network Efficiency via Post-training Quantization with Adaptive Floating-Point , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Li Jiang,et al.  IM3A: Boosting Deep Neural Network Efficiency via In-Memory Addressing-Assisted Acceleration , 2021, ACM Great Lakes Symposium on VLSI.

[4]  Hang Liu,et al.  FORMS: Fine-grained Polarized ReRAM-based In-situ Computation for Mixed-signal DNN Accelerator , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[5]  Wenbo Zhao,et al.  SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit Sparsity of Neural Network , 2021, 2021 IEEE 39th International Conference on Computer Design (ICCD).

[6]  Yiran Chen,et al.  BSQ: Exploring Bit-Level Sparsity for Mixed-Precision Neural Network Quantization , 2021, ICLR.

[7]  Xin Si,et al.  15.4 A 5.99-to-691.1TOPS/W Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity-Based Optimization and Variable-Precision Quantization , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[8]  Yanzhi Wang,et al.  Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[9]  Li Jiang,et al.  DRQ: Dynamic Region-based Quantization for Deep Neural Network Acceleration , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[10]  Yuan Xie,et al.  Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey , 2020, Proceedings of the IEEE.

[11]  Bin Gao,et al.  Fully hardware-implemented memristor convolutional neural network , 2020, Nature.

[12]  Wei Tang,et al.  CASCADE: Connecting RRAMs to Extend Analog Dataflow In An End-To-End In-Memory Processing Paradigm , 2019, MICRO.

[13]  Zhiru Zhang,et al.  Boosting the Performance of CNN Accelerators with Dynamic Fine-Grained Channel Gating , 2019, MICRO.

[14]  Wei Wang,et al.  Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks , 2019, International Conference on Learning Representations.

[15]  Anand Raghunathan,et al.  X-MANN: A Crossbar based Architecture for Memory Augmented Neural Networks , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[16]  Swagath Venkataramani,et al.  BiScaled-DNN: Quantizing Long-tailed Datastructures with Two Scale Factors for Deep Neural Networks , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[17]  Jie Lin,et al.  Noise Injection Adaption: End-to-End ReRAM Crossbar Non-ideal Effect Adaption for Neural Network Mapping , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[18]  Chia-Lin Yang,et al.  Sparse ReRAM Engine: Joint Exploration of Activation and Weight Sparsity in Compressed Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[19]  Kurt Keutzer,et al.  HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Steven K. Esser,et al.  Learned Step Size Quantization , 2019, ICLR.

[21]  Yun Liang,et al.  REQ-YOLO: A Resource-Aware, Efficient Quantization Framework for Object Detection on FPGAs , 2019, FPGA.

[22]  Dejan S. Milojicic,et al.  PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference , 2019, ASPLOS.

[23]  Zhijian Liu,et al.  HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Deliang Fan,et al.  Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network Using Truncated Gaussian Approximation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Magnus Själander,et al.  BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[26]  Scott A. Mahlke,et al.  In-Memory Data Parallel Processor , 2018, ASPLOS.

[27]  Rajeev Balasubramonian,et al.  Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration , 2018, IEEE Micro.

[28]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[29]  Meng-Fan Chang,et al.  A 65nm 1Mb nonvolatile computing-in-memory ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN AI edge processors , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[30]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Hadi Esmaeilzadeh,et al.  Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[33]  Wei Pan,et al.  Towards Accurate Binary Convolutional Neural Network , 2017, NIPS.

[34]  Shenghuo Zhu,et al.  Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM , 2017, AAAI.

[35]  Jan Reineke,et al.  Ascertaining Uncertainty for Efficient Exact Cache Analysis , 2017, CAV.

[36]  Yiran Chen,et al.  PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[37]  Lin Xu,et al.  Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights , 2017, ICLR.

[38]  Yu Wang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[39]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[40]  Vivienne Sze,et al.  Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[41]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[43]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[44]  Jian Cheng,et al.  Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[47]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[48]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[49]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[50]  Yiran Chen,et al.  Reduction and IR-drop compensations techniques for reliable neuromorphic computing systems , 2014, 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[51]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[52]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[53]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[54]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[56]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .