FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

We show in this work that memory intensive computations can result in severe performance problems due to off-chip memory access and CPU-GPU context switch overheads in a wide range of deep learning models. For this problem, current just-in-time kernel fusion and code generation techniques have limitations, such as kernel schedule incompatibilities and rough fusion plan exploration strategies. We propose FusionStitching, a Deep Learning compiler capable of fusing memory intensive operators, with varied data dependencies and non-homogeneous parallelism, into large GPU kernels to reduce global memory access and operation scheduling overhead automatically. FusionStitching explores large fusion spaces to decide optimal fusion plans with considerations of memory access costs, kernel calls and resource usage constraints. We thoroughly study the schemes to stitch operators together for complex scenarios. FusionStitching tunes the optimal stitching scheme just-in-time with a domain-specific cost model efficiently. Experimental results show that FusionStitching can reach up to 2.78x speedup compared to TensorFlow and current state-of-the-art. Besides these experimental results, we integrated our approach into a compiler product and deployed it onto a production cluster for AI workloads with thousands of GPUs. The system has been in operation for more than 4 months and saves 7,000 GPU hours on average for approximately 30,000 tasks per month.

[1]  Sriram Krishnamoorthy,et al.  A Code Generator for High-Performance Tensor Contractions on GPUs , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[2]  Shoaib Kamil,et al.  Tiramisu : A Polyhedral Compiler with A Scheduling Language for Targeting High Performance Systems Riyadh , 2018 .

[3]  Timothy J. Harvey,et al.  AS imple, Fast Dominance Algorithm , 1999 .

[4]  Marco D. Santambrogio,et al.  DAG-based Scheduling with Resource Sharing for Multi-task Applications in a Polyglot GPU Runtime , 2021, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[5]  Mohamed Wahib,et al.  Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Sridhar Radhakrishnan,et al.  Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs , 2015, ArXiv.

[7]  Hai Liu,et al.  Latte: a language, compiler, and runtime for elegant and efficient deep neural networks , 2016, PLDI.

[8]  Phil Blunsom,et al.  Optimizing Performance of Recurrent Neural Networks on GPUs , 2016, ArXiv.

[9]  Bertrand A. Maher,et al.  Glow: Graph Lowering Compiler Techniques for Neural Networks , 2018, ArXiv.

[10]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[11]  Ankur Bapna,et al.  Faster Transformer Decoding: N-gram Masked Self-Attention , 2020, ArXiv.

[12]  Shirish Tatikonda,et al.  On optimizing machine learning workloads via kernel fusion , 2015, PPoPP.

[13]  Cody Hao Yu,et al.  Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.

[14]  Alexander Aiken,et al.  TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.

[15]  Lidong Zhou,et al.  Astra: Exploiting Predictability to Optimize Deep Learning , 2019, ASPLOS.

[16]  Yun Liang,et al.  An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[17]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[18]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[20]  Thierry Moreau,et al.  Automatic generation of high-performance quantized machine learning kernels , 2020, CGO.

[21]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[22]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[23]  Marco Maggioni,et al.  Dissecting the NVidia Turing T4 GPU via Microbenchmarking , 2019, ArXiv.

[24]  Wenguang Chen,et al.  VersaPipe: A Versatile Programming Framework for Pipelined Computing on GPU , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Ao Li,et al.  Automatic Horizontal Fusion for GPU Kernels , 2020, 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[26]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[27]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Rui Mao,et al.  A Performance Model for GPU Architectures that Considers On-Chip Resources: Application to Medical Image Registration , 2019, IEEE Transactions on Parallel and Distributed Systems.

[31]  Jürgen Teich,et al.  From Loop Fusion to Kernel Fusion: A Domain-Specific Approach to Locality Optimization , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[32]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[33]  David Furcy,et al.  Limited Discrepancy Beam Search , 2005, IJCAI.

[34]  Dik Lun Lee,et al.  Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba , 2018, KDD.

[35]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[36]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[37]  Jürgen Teich,et al.  Automatic Kernel Fusion for Image Processing DSLs , 2018, SCOPES.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  John W. Merrill,et al.  Automatic Speech Recognition , 2005 .

[40]  Ken Kennedy,et al.  AS imple, Fast Dominance Algorithm , 1999 .

[41]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[42]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Chang Zhou,et al.  Deep Interest Evolution Network for Click-Through Rate Prediction , 2018, AAAI.

[44]  Mattan Erez,et al.  DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[45]  Paolo Bientinesi,et al.  Program generation for small-scale linear algebra applications , 2018, CGO.

[46]  Sudhakar Yalamanchili,et al.  Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[47]  Matei Zaharia,et al.  Optimizing DNN Computation with Relaxed Graph Substitutions , 2019, MLSys.

[48]  Erich Elsen,et al.  Persistent RNNs: Stashing Recurrent Weights On-Chip , 2016, ICML.

[49]  Yanqi Zhou,et al.  A Learned Performance Model for the Tensor Processing Unit , 2020, ArXiv.

[50]  Zhiqiang Xie,et al.  Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks , 2020, OSDI.

[51]  David Gregg,et al.  Optimal DNN primitive selection with partitioned boolean quadratic programming , 2018, CGO.

[52]  Matthias Korch,et al.  Accelerating explicit ODE methods on GPUs by kernel fusion , 2018, Concurr. Comput. Pract. Exp..

[53]  Matthew W. Moskewicz,et al.  Boda: A Holistic Approach for Implementing Neural Network Computations , 2017, Conf. Computing Frontiers.

[54]  Mingyu Chen,et al.  Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning , 2017, PPoPP.