LSMQ: A Layer-Wise Sensitivity-Based Mixed-Precision Quantization Method for Bit-Flexible CNN Accelerator

Model quantization is a prevailing way to accelerate convolutional neural network (CNN). Quantization with mixed precision tends to compress the model better and further improves the computation efficiency. However, it is challenging to identify the optimal bit width for each layer. In this paper, we proposed a mixed-precision quantization(LSMQ) method based on layer-wise sensitivity. We calculated the sensitivity of each layer first, then the weight of each layer would be automatically quantized with unique precision determined by the sensitivity ranking and a valid search strategy without retraining. Moreover, we presented a bit-flexible CNN accelerator that can efficiently support data operations with varying bit widths after mixed-precision quantization. Experiment on LSMQ shows that the top-1 accuracy for VGG16 based on the LSMQ method is 7.31% higher, while the model size is 3.4% smaller compared with previous work.