XDN: Towards Efficient Inference of Residual Neural Networks on Cambricon Chips

In this paper, we present XDN, an optimization and inference engine for accelerating residual neural networks on Cambricon chips. We leverage a channel pruning method to compress the weights of ResNet-50. By exploring the optimization opportunities in computational graphs, we propose a layer fusion strategy, which dramatically decreases the number of scalar computation layers, such as Batch Normalization, Scale. Furthermore, we design an efficient implementation of XDN, including data preprocessing, hyper-parameter auto-tuning, etc. The experimental results show that the ResNet-50 model can achieve significant speedup without accuracy loss by using our XDN engine.

[1]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Fan Zhang,et al.  AIoT Bench: Towards Comprehensive Benchmarking Mobile and Embedded Device Intelligence , 2018, Bench.

[3]  Tianshu Hao,et al.  The Implementation and Optimization of Matrix Decomposition Based Collaborative Filtering Task on X86 Platform , 2019, Bench.

[4]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[5]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Xu Wen,et al.  Improving RGB-D Face Recognition via Transfer Learning from a Pretrained 2D Network , 2019, Bench.

[9]  Yanjun Wu,et al.  RVTensor: A Light-Weight Neural Network Inference Framework Based on the RISC-V Architecture , 2019, Bench.

[10]  Minyi Guo,et al.  PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms , 2019, Bench.

[11]  Zihan Jiang,et al.  Performance Analysis of Cambricon MLU100 , 2019, Bench.

[12]  Yuchen Zhang,et al.  HPC AI500: A Benchmark Suite for HPC AI Systems , 2018, Bench.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[15]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[16]  Yu Wang,et al.  [DL] A Survey of FPGA-based Neural Network Inference Accelerators , 2019, ACM Trans. Reconfigurable Technol. Syst..

[17]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[18]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Minghe Yu,et al.  AIBench: An Industry Standard Internet Service AI Benchmark Suite , 2019, ArXiv.

[20]  Fan Zhang,et al.  AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking , 2018, Bench.

[21]  YU WANG,et al.  A Survey of FPGA-Based Neural Network Inference Accelerator , 2019 .

[22]  Kai Hwang,et al.  Edge AIBench: Towards Comprehensive End-to-end Edge Computing Benchmarking , 2018, Bench.