High-Performance Deep Learning via a Single Building Block

Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each workload/architecture, leading to numerous, complex code-bases that strive for performance, yet they are hard to maintain and do not generalize. In this work, we introduce the batch-reduce GEMM kernel and show how the most popular DL algorithms can be formulated with this kernel as the basic building-block. Consequently, the DL library-development degenerates to mere (potentially automatic) tuning of loops around this sole optimized kernel. By exploiting our new kernel we implement Recurrent Neural Networks, Convolution Neural Networks and Multilayer Perceptron training and inference primitives in just 3K lines of high-level code. Our primitives outperform vendor-optimized libraries on multi-node CPU clusters, and we also provide proof-of-concept CNN kernels targeting GPUs. Finally, we demonstrate that the batch-reduce GEMM kernel within a tensor compiler yields high-performance CNN primitives, further amplifying the viability of our approach.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[3]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[4]  Patrice Y. Simard,et al.  High Performance Convolutional Neural Networks for Document Processing , 2006 .

[5]  Pradeep Dubey,et al.  On Scale-out Deep Learning Training for Cloud and HPC , 2018, ArXiv.

[6]  David Gregg,et al.  Parallel Multi Channel convolution using General Matrix Multiplication , 2017, 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[7]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[8]  Alexander Heinecke,et al.  Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[13]  Jeffrey S. Vetter,et al.  NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[14]  Jack Dongarra,et al.  MagmaDNN 0.2 High-Performance Data Analytics for Manycore GPUs and CPUs , 2019 .

[15]  U. N. Niranjan,et al.  Tensor Contractions with Extended BLAS Kernels on CPU and GPU , 2016, HiPC 2016.

[16]  Rengan Xu,et al.  Deep Learning at Scale on NVIDIA V100 Accelerators , 2018, 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[17]  Tim Zerrell,et al.  Stripe: Tensor Compilation via the Nested Polyhedral Model , 2019, ArXiv.

[18]  Yann Le Cun,et al.  A Theoretical Framework for Back-Propagation , 1988 .

[19]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[20]  Jack J. Dongarra,et al.  The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems , 2017, ICCS.

[21]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[24]  James Demmel,et al.  Large-batch training for LSTM and beyond , 2019, SC.

[25]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[26]  Minjia Zhang,et al.  DeepCPU: Serving RNN-based Deep Learning Models 10x Faster , 2018, USENIX Annual Technical Conference.

[27]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[28]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[29]  Yida Wang,et al.  Optimizing CNN Model Inference on CPUs , 2018, USENIX Annual Technical Conference.

[30]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[33]  Alexander Heinecke,et al.  LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Michael Kruse,et al.  High-Performance Generalized Tensor Operations , 2018, ACM Trans. Archit. Code Optim..

[35]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[37]  Kunle Olukotun,et al.  High-Accuracy Low-Precision Training , 2018, ArXiv.

[38]  Bertrand A. Maher,et al.  Glow: Graph Lowering Compiler Techniques for Neural Networks , 2018, ArXiv.

[39]  David Gregg,et al.  Low-memory GEMM-based convolution algorithms for deep neural networks , 2017, ArXiv.

[40]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.