DNNFusion: accelerating deep neural networks execution with advanced operator fusion

Deep Neural Networks (DNNs) have emerged as the core enabler of many major applications on mobile devices. To achieve high accuracy, DNN models have become increasingly deep with hundreds or even thousands of operator layers, leading to high memory and computational requirements for inference. Operator fusion (or kernel/layer fusion) is key optimization in many state-of-the-art DNN execution frameworks, such as TensorFlow, TVM, and MNN, that aim to improve the efficiency of the DNN inference. However, these frameworks usually adopt fusion approaches based on certain patterns that are too restrictive to cover the diversity of operators and layer connections, especially those seen in many extremely deep models. Polyhedral-based loop fusion techniques, on the other hand, work on a low-level view of the computation without operator-level information, and can also miss potential fusion opportunities. To address this challenge, this paper proposes a novel and extensive loop fusion framework called DNNFusion. The basic idea of this work is to work at an operator view of DNNs, but expand fusion opportunities by developing a classification of both individual operators and their combinations. In addition, DNNFusion includes 1) a novel mathematical-property-based graph rewriting framework to reduce evaluation costs and facilitate subsequent operator fusion, 2) an integrated fusion plan generation that leverages the high-level analysis and accurate light-weight profiling, and 3) additional optimizations during fusion code generation. DNNFusion is extensively evaluated on 15 DNN models with varied types of tasks, model sizes, and layer counts. The evaluation results demonstrate that DNNFusion finds up to 8.8 × higher fusion opportunities, outperforms four state-of-the-art DNN execution frameworks with 9.3× speedup. The memory requirement reduction and speedups can enable the execution of many of the target models on mobile devices and even make them part of a real-time application.

[1]  Uday Bondhugula,et al.  MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.

[2]  Richard Veras,et al.  When polyhedral transformations meet SIMD code generation , 2013, PLDI.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[5]  Yafeng Yang,et al.  MNN: A Universal and Efficient Inference Engine , 2020, MLSys.

[6]  Cortex: A Compiler for Recursive Deep Learning Models , 2020, ArXiv.

[7]  Alain Darte On the Complexity of Loop Fusion , 2000, Parallel Comput..

[8]  Thierry Moreau,et al.  Learning to Optimize Tensor Programs , 2018, NeurIPS.

[9]  Peter Ahrens,et al.  Tensor Algebra Compilation with Workspaces , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[10]  Shoaib Kamil,et al.  Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[11]  Hong-Yuan Mark Liao,et al.  YOLOv4: Optimal Speed and Accuracy of Object Detection , 2020, ArXiv.

[12]  Manuel Selva,et al.  Building a Polyhedral Representation from an Instrumented Execution , 2020, ACM Trans. Archit. Code Optim..

[13]  Yanzhi Wang,et al.  PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning , 2020, ASPLOS.

[14]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[15]  Emmanuel Kounalis,et al.  Compilation of Pattern Matching with Associative-Commutative Functions , 1991, TAPSOFT, Vol.1.

[16]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[17]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[18]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[19]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[20]  Alexander Aiken,et al.  TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.

[21]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[22]  Uday Bondhugula,et al.  A model for fusion and code motion in an automatic parallelizing compiler , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Keith D. Cooper,et al.  Operator strength reduction , 2001, TOPL.

[24]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[25]  Markus Püschel,et al.  A Basic Linear Algebra Compiler , 2014, CGO '14.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[28]  Yifan Gong,et al.  RTMobile: Beyond Real-Time Mobile Acceleration of RNNs for Speech Recognition , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[29]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Christophe Dubach,et al.  Automatic generation of specialized direct convolutions for mobile GPUs , 2020, GPGPU@PPoPP.

[31]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Paliath Narendran,et al.  Complexity of Matching Problems , 1987, J. Symb. Comput..

[33]  Vivek Sarkar,et al.  Optimal weighted loop fusion for parallel programs , 1997, SPAA '97.

[34]  Manuel Krebber,et al.  Non-linear Associative-Commutative Many-to-One Pattern Matching with Sequence Variables , 2017, ArXiv.

[35]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[36]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[37]  Uday Bondhugula,et al.  Loop transformations: convexity, pruning and optimization , 2011, POPL '11.

[38]  Shoaib Kamil,et al.  The tensor algebra compiler , 2017, Proc. ACM Program. Lang..

[39]  Vivek Sarkar,et al.  Polyhedral Optimizations of Explicitly Parallel Programs , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[40]  Vivek Seshadri,et al.  Compiling KB-sized machine learning models to tiny IoT devices , 2019, PLDI.

[41]  Shirish Tatikonda,et al.  SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning , 2017, CIDR.

[42]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[43]  Berthold Reinwald,et al.  On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML , 2018, Proc. VLDB Endow..

[44]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[45]  Uday Bondhugula,et al.  Polyhedral auto-transformation with no integer linear programming , 2018, PLDI.

[46]  Albert Cohen,et al.  Iterative optimization in the polyhedral model: part ii, multidimensional time , 2008, PLDI '08.

[47]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[48]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[49]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[50]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[51]  Zhiqiang Xie,et al.  Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks , 2020, OSDI.

[52]  Mary M. Rodgers,et al.  Recent Advances in Wearable Sensors for Health Monitoring , 2015, IEEE Sensors Journal.

[53]  Mary W. Hall,et al.  Non-affine Extensions to Polyhedral Code Generation , 2014, CGO '14.

[54]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[55]  Saumya K. Debray,et al.  Unfold/fold transformations and loop optimization of logic programs , 1988, PLDI '88.

[56]  Markus Püschel,et al.  A basic linear algebra compiler for embedded processors , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[57]  CohenAlbert,et al.  DNNFusion: accelerating deep neural networks execution with advanced operator fusion , 2020, ACM Trans. Archit. Code Optim..

[58]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[59]  Benoît Meister,et al.  Polyhedral Optimization of TensorFlow Computation Graphs , 2017, ESPT/VPA@SC.

[60]  Yiming Yang,et al.  MobileBERT: Task-Agnostic Compression of BERT by Progressive Knowledge Transfer , 2019 .

[61]  Sebastian Hack,et al.  Polly's Polyhedral Scheduling in the Presence of Reductions , 2015, ArXiv.

[62]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[63]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[64]  Mary W. Hall,et al.  Loop and data transformations for sparse matrix code , 2015, PLDI.

[65]  Nicholas D. Lane,et al.  From smart to deep: Robust activity recognition on smartwatches using deep learning , 2016, 2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops).

[66]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.