论文信息 - SparseDNN: Fast Sparse Deep Learning Inference on CPUs

SparseDNN: Fast Sparse Deep Learning Inference on CPUs

The last few years have seen gigantic leaps in algorithms and systems to support efficient deep learning inference. Pruning and quantization algorithms can now consistently compress neural networks by an order of magnitude. For a compressed neural network, a multitude of inference frameworks have been designed to maximize the performance of the target hardware. While we find mature support for quantized neural networks in production frameworks such as OpenVINO and MNN, support for pruned sparse neural networks is still lacking. To tackle this challenge, we present SparseDNN, a sparse deep learning inference engine targeting CPUs. We present both kernel-level optimizations with a sparse code generator to accelerate sparse operators and novel network-level optimizations catering to sparse networks. We show that our sparse code generator can achieve significant speedups over state-of-the-art sparse and dense libraries. On end-to-end benchmarks such as Huggingface pruneBERT, SparseDNN achieves up to 5x throughput improvement over dense inference with state-of-the-art OpenVINO. ACM Reference Format: Ziheng Wang. 2021. SparseDNN: Fast Sparse Deep Learning Inference on CPUs. In Proceedings of ACM Conference (Conference’17). ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

Ziheng Wang

[1] Yury Gorbachev,et al. OpenVINO Deep Learning Workbench: Comprehensive Analysis and Tuning of Neural Networks Inference , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[2] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[3] Erich Elsen,et al. The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[4] Erich Elsen,et al. Sparse GPU Kernels for Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5] Alexander Kozlov,et al. Neural Network Compression Framework for fast model inference , 2020, ArXiv.

[6] John D. Owens,et al. Design Principles for Sparse Matrix Multiplication on the GPU , 2018, Euro-Par.

[7] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[8] Yafeng Yang,et al. MNN: A Universal and Efficient Inference Engine , 2020, MLSys.

[9] Ziheng Wang,et al. SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference , 2020, PACT.

[10] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[11] Chen Zhang,et al. Balanced Sparsity for Efficient DNN Inference on GPU , 2018, AAAI.

[12] Max Welling,et al. Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[13] Yida Wang,et al. Optimizing CNN Model Inference on CPUs , 2018, USENIX Annual Technical Conference.

[14] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[15] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[16] Bo Chen,et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.