论文信息 - Accelerate Binarized Neural Networks with Processing-in-Memory Enabled by RISC-V Custom Instructions

Accelerate Binarized Neural Networks with Processing-in-Memory Enabled by RISC-V Custom Instructions

As the speed of processing units grows rapidly, the bottleneck of system’s performance is usually the speed of memory, and the situation is the so-called ”Memory Wall”. There are emerging technologies trying to take down the ”Memory Wall”, and one of them is Processing-in-Memory (PIM). Processing-in-Memory means that the data are processed just inside the memory itself. It does not need to take time to travel between CPU and Memory. Moreover, for very little modifications to memory devices, the memory can do primitive bit-wise operations at the memory side. Binarized Neural Network (BNN), which replaces the convolution’s multiplication and addition operations with bit-wise AND and population count operations, is therefore suited for utilizing PIM to gain performance. This work architects PIM AND, NOT, and population count operations and enables PIM operations working under RISC-V custom instruction encodings. Besides, we also utilize TVM’s support of BNN for application sources. In addition, we offer a new design for BNN’s convolution in which a better memory layout is considered. With our design, the results of the speedup range from 3.7x to 57.3x comparing with CPU-based system for the execution time of end-to-end BNN model inferences.

[1] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2] Frédo Durand,et al. Halide , 2017, Commun. ACM.

[3] Forrest N. Iandola,et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[4] Yunsup Lee,et al. The RISC-V Instruction Set Manual , 2014 .

[5] Onur Mutlu,et al. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[7] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[8] Cong Xu,et al. Pinatubo: A processing-in-memory architecture for bulk bitwise operations in emerging non-volatile memories , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[9] Thierry Moreau,et al. Automating Generation of Low Precision Deep Learning Operators , 2018, ArXiv.

[10] Thomas F. Wenisch,et al. Simulating DRAM controllers for future system architecture exploration , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[11] Shwetak N. Patel,et al. Riptide: Fast End-to-End Binarized Neural Networks , 2020, MLSys.

[12] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13] Tianqi Chen,et al. Relay: a new IR for machine learning frameworks , 2018, MAPL@PLDI.