Tensor Processing Primitives: A Programming Abstraction for Efficiency and Portability in Deep Learning Workloads

During the past decade, novel Deep Learning (DL) algorithms/workloads and hardware have been developed to tackle a wide range of problems. Despite the advances in workload/hardware ecosystems, the programming methodology of DL systems is stagnant. DL workloads leverage either highly-optimized, yet platform-specific and inflexible kernels from DL libraries, or in the case of novel operators, reference implementations are built via DL framework primitives with underwhelming performance. This work introduces the Tensor Processing Primitives (TPP), a programming abstraction striving for efficient, portable implementation of DL workloads with high-productivity. TPPs define a compact, yet versatile set of 2D-tensor operators (or a virtual Tensor ISA), which subsequently can be utilized as building-blocks to construct complex operators on high-dimensional tensors. The TPP specification is platform-agnostic, thus code expressed via TPPs is portable, whereas the TPP implementation is highly-optimized and platform-specific. We demonstrate the efficacy of our approach using standalone kernels and end-to-end DL workloads expressed entirely via TPPs that outperform state-of-the-art implementations on multiple platforms.

[1]  Rustam Z. Khaliullin,et al.  CP2K: An electronic structure and molecular dynamics software package - Quickstep: Efficient and accurate electronic structure calculations. , 2020, The Journal of chemical physics.

[2]  Freddie D. Witherden,et al.  PyFR: An open source framework for solving advection-diffusion type problems on streaming architectures using the flux reconstruction approach , 2013, Comput. Phys. Commun..

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[5]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Johnny Israeli,et al.  AtacWorks: A deep convolutional neural network toolkit for epigenomics , 2019, bioRxiv.

[7]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[8]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Paul Barham,et al.  Machine Learning Systems are Stuck in a Rut , 2019, HotOS.

[11]  Zheng Zhang,et al.  Learning Graph Neural Networks with Deep Graph Library , 2020, WWW.

[12]  Pradeep Dubey,et al.  Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[15]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[16]  Abhinav Vishnu,et al.  Deep learning for computational chemistry , 2017, J. Comput. Chem..

[17]  Philippe Flajolet,et al.  The Number of Registers Required for Evaluating Arithmetic Expressions , 1979, Theor. Comput. Sci..

[18]  Alfio Lazzaro,et al.  DBCSR: A Blocked Sparse Tensor Algebra Library , 2019, PARCO.

[19]  Tim Zerrell,et al.  Stripe: Tensor Compilation via the Nested Polyhedral Model , 2019, ArXiv.

[20]  Sanchit Misra,et al.  Deep Graph Library Optimizations for Intel(R) x86 Architecture , 2020, ArXiv.

[21]  David Cox,et al.  Triton: an intermediate language and compiler for tiled neural network computations , 2019, MAPL@PLDI.

[22]  Jinyu Li,et al.  Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[23]  Xiaoyan Liu,et al.  The Deep Learning Compiler: A Comprehensive Survey , 2020, IEEE Transactions on Parallel and Distributed Systems.

[24]  Alexander Heinecke,et al.  Optimizing Deep Learning RNN Topologies on Intel Architecture , 2019, Supercomput. Front. Innov..

[25]  Alexander Heinecke,et al.  Harnessing Deep Learning via a Single Building Block , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[26]  Maithra Raghu,et al.  A Survey of Deep Learning for Scientific Discovery , 2020, ArXiv.

[27]  Chris Yakopcic,et al.  A State-of-the-Art Survey on Deep Learning Theory and Architectures , 2019, Electronics.

[28]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Alexander Heinecke,et al.  EDGE: Extreme Scale Fused Seismic Simulations with the Discontinuous Galerkin Method , 2017, ISC.

[31]  M. Powell,et al.  Approximation theory and methods , 1984 .

[32]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[33]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[34]  Alexander Heinecke,et al.  Anatomy of High-Performance Deep Learning Convolutions on SIMD Architectures , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[35]  Alexander Heinecke,et al.  Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Minjia Zhang,et al.  DeepCPU: Serving RNN-based Deep Learning Models 10x Faster , 2018, USENIX Annual Technical Conference.

[37]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Alexander Heinecke,et al.  LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[39]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[40]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[41]  Cody Hao Yu,et al.  Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.

[42]  Gisbert Schneider,et al.  Deep Learning in Drug Discovery , 2016, Molecular informatics.

[43]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[44]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.