FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads

Performance optimization is the art of continuous seeking a harmonious mapping between the application domain and hardware. Recent years have witnessed a surge of deep learning (DL) applications in industry. Conventional wisdom for optimizing such workloads mainly focus on compute intensive ops (GEMM, Convolution, etc). Yet we show in this work, that the performance of memory intensive computations is vital to E2E performance in practical DL models. We propose \emph{FusionStitching}, a optimization framework capable of fusing memory intensive \emph{elementwise}, \emph{reduction} and fine grained \emph{GEMM/Batched-GEMM} ops, with or without data dependences, into large computation units, then mapping and transforming them into efficient GPU kernels. We formulate the fusion plan optimization as an integer linear programming (ILP) problem, and propose a set of empirical heuristics to reduce the combinatorial search space. In order to map optimized fusion plans to hardware, we propose a technique to effectively compose various groups of computations into a single GPU kernel, by fully leveraging on chip resources like scratchpads or registers. Experimental results on six benchmarks and four industry scale practical models are encouraging. Overall, \emph{FusionStitching} can reach up to 5.7x speedup compared to Tensorflow baseline, and achieves 1.25x to 1.85x performance speedups compared to current state of the art, with 1.4x on average (geometric mean).

[1]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[2]  Matthias Korch,et al.  Accelerating explicit ODE methods on GPUs by kernel fusion , 2018, Concurr. Comput. Pract. Exp..

[3]  Sudhakar Yalamanchili,et al.  Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2004, J. Parallel Distributed Comput..

[5]  Wenguang Chen,et al.  VersaPipe: A Versatile Programming Framework for Pipelined Computing on GPU , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Lidong Zhou,et al.  Astra: Exploiting Predictability to Optimize Deep Learning , 2019, ASPLOS.

[7]  Shirish Tatikonda,et al.  On optimizing machine learning workloads via kernel fusion , 2015, PPoPP.

[8]  Jürgen Teich,et al.  Automatic Kernel Fusion for Image Processing DSLs , 2018, SCOPES.

[9]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[10]  Jürgen Teich,et al.  From Loop Fusion to Kernel Fusion: A Domain-Specific Approach to Locality Optimization , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[11]  Mohamed Wahib,et al.  Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  António Branco,et al.  Attention Focusing for Neural Machine Translation by Bridging Source and Target Embeddings , 2017, ACL.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[15]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[16]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[17]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[18]  Matthew W. Moskewicz,et al.  Boda: A Holistic Approach for Implementing Neural Network Computations , 2017, Conf. Computing Frontiers.

[19]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Frédo Durand,et al.  Halide , 2017, Commun. ACM.

[21]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22]  Ken Kennedy,et al.  AS imple, Fast Dominance Algorithm , 1999 .

[23]  Hai Liu,et al.  Latte: a language, compiler, and runtime for elegant and efficient deep neural networks , 2016, PLDI.

[24]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[25]  Dik Lun Lee,et al.  Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba , 2018, KDD.

[26]  Sridhar Radhakrishnan,et al.  Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs , 2015, ArXiv.

[27]  David Gregg,et al.  Optimal DNN primitive selection with partitioned boolean quadratic programming , 2018, CGO.

[28]  Paolo Bientinesi,et al.  Program generation for small-scale linear algebra applications , 2018, CGO.