SparseDNN: Fast Sparse Deep Learning Inference on CPUs

The last few years have seen gigantic leaps in algorithms and systems to support efficient deep learning inference. Pruning and quantization algorithms can now consistently compress neural networks by an order of magnitude. For a compressed neural network, a multitude of inference frameworks have been designed to maximize the performance of the target hardware. While we find mature support for quantized neural networks in production frameworks such as OpenVINO and MNN, support for pruned sparse neural networks is still lacking. To tackle this challenge, we present SparseDNN, a sparse deep learning inference engine targeting CPUs. We present both kernel-level optimizations with a sparse code generator to accelerate sparse operators and novel network-level optimizations catering to sparse networks. We show that our sparse code generator can achieve significant speedups over state-of-the-art sparse and dense libraries. On end-to-end benchmarks such as Huggingface pruneBERT, SparseDNN achieves up to 5x throughput improvement over dense inference with state-of-the-art OpenVINO. ACM Reference Format: Ziheng Wang. 2021. SparseDNN: Fast Sparse Deep Learning Inference on CPUs. In Proceedings of ACM Conference (Conference’17). ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

[1]  Yury Gorbachev,et al.  OpenVINO Deep Learning Workbench: Comprehensive Analysis and Tuning of Neural Networks Inference , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[2]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[3]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[4]  Erich Elsen,et al.  Sparse GPU Kernels for Deep Learning , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Alexander Kozlov,et al.  Neural Network Compression Framework for fast model inference , 2020, ArXiv.

[6]  John D. Owens,et al.  Design Principles for Sparse Matrix Multiplication on the GPU , 2018, Euro-Par.

[7]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[8]  Yafeng Yang,et al.  MNN: A Universal and Efficient Inference Engine , 2020, MLSys.

[9]  Ziheng Wang,et al.  SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference , 2020, PACT.

[10]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[11]  Chen Zhang,et al.  Balanced Sparsity for Efficient DNN Inference on GPU , 2018, AAAI.

[12]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[13]  Yida Wang,et al.  Optimizing CNN Model Inference on CPUs , 2018, USENIX Annual Technical Conference.

[14]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[15]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[16]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[17]  David Cox,et al.  Triton: an intermediate language and compiler for tiled neural network computations , 2019, MAPL@PLDI.

[18]  P. Sadayappan,et al.  Adaptive sparse tiling for sparse matrix multiplication , 2019, PPoPP.

[19]  Amos Storkey,et al.  A Closer Look at Structured Pruning for Neural Network Compression , 2018 .

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[22]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[23]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[24]  Alexander M. Rush,et al.  Movement Pruning: Adaptive Sparsity by Fine-Tuning , 2020, NeurIPS.

[25]  Larry Carter,et al.  Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[26]  N. Santhanam,et al.  Artificial-intelligence hardware: New opportunities for semiconductor companies , 2019 .

[27]  Srinivasan Parthasarathy,et al.  Efficient sparse-matrix multi-vector product on GPUs , 2018, HPDC.

[28]  Erich Elsen,et al.  Rigging the Lottery: Making All Tickets Winners , 2020, ICML.

[29]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[30]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[31]  Michael Anderson,et al.  High-Performance Deep Learning via a Single Building Block , 2019, ArXiv.

[32]  Gagan Agrawal,et al.  A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs , 2020, PPoPP.

[33]  Ziheng Wang,et al.  Structured Pruning of Large Language Models , 2019, EMNLP.

[34]  Pradeep Dubey,et al.  Faster CNNs with Direct Sparse Convolutions and Guided Pruning , 2016, ICLR.

[35]  Xuhao Chen,et al.  Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs , 2018, 1802.10280.

[36]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018 .