Performance Analysis of Cambricon MLU100

In recent years, domain-specific hardware has brought significant performance improvements in deep learning (DL). Many frequently-used optimization techniques, such as data parallelism, model parallelism, data pipeline, weights pruning and quantization have been proposed to accelerate the inference phase of DL workloads. However, there is still lack of a comparison of these optimization techniques to show their performance difference on dedicated accelerators. This paper evaluates these frequently-used optimization techniques on a commercial accelerator, namely Cambricon MLU100. Considering the requirement of accuracy of DL nature, our metric not only measures the inference throughput but also has an accuracy constraint. Based on our analysis methodology and performance numbers, we have some key observations and implications that are valuable for the future DL hardware and software co-design. Furthermore, we explore the upper bound of MLU100 inference performance under the standard ResNet-50 model and CIFAR-10 dataset.

[1]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[2]  Kai Hwang,et al.  Edge AIBench: Towards Comprehensive End-to-end Edge Computing Benchmarking , 2018, Bench.

[3]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[4]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[5]  Yuchen Zhang,et al.  HPC AI500: A Benchmark Suite for HPC AI Systems , 2018, Bench.

[6]  Fan Zhang,et al.  AIoT Bench: Towards Comprehensive Benchmarking Mobile and Embedded Device Intelligence , 2018, Bench.

[7]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  David A. Patterson,et al.  A new golden age for computer architecture , 2019, Commun. ACM.

[9]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[10]  Yifan Wang,et al.  Exploring the Performance Bound of Cambricon Accelerator in End-to-End Inference Scenario , 2019, Bench.

[11]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[15]  Yanjun Wu,et al.  RVTensor: A Light-Weight Neural Network Inference Framework Based on the RISC-V Architecture , 2019, Bench.

[16]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Shaoli Liu,et al.  Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[19]  Fan Zhang,et al.  AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking , 2018, Bench.

[20]  Huiqian Niu,et al.  An Implementation of ResNet on the Classification of RGB-D Images , 2019, Bench.

[21]  Tianshu Hao,et al.  The Implementation and Optimization of Matrix Decomposition Based Collaborative Filtering Task on X86 Platform , 2019, Bench.

[22]  Minyi Guo,et al.  PSL: Exploiting Parallelism, Sparsity and Locality to Accelerate Matrix Factorization on x86 Platforms , 2019, Bench.

[23]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[24]  Xu Wen,et al.  Improving RGB-D Face Recognition via Transfer Learning from a Pretrained 2D Network , 2019, Bench.