The Deep Learning Compiler: A Comprehensive Survey

The difficulty of deploying various deep learning (DL) models on diverse DL hardware has boosted the research and development of DL compilers in the community. Several DL compilers have been proposed from both industry and academia such as Tensorflow XLA and TVM. Similarly, the DL compilers take the DL models described in different DL frameworks as input, and then generate optimized codes for diverse DL hardware as output. However, none of the existing survey has analyzed the unique design architecture of the DL compilers comprehensively. In this article, we perform a comprehensive survey of existing DL compilers by dissecting the commonly adopted design in details, with emphasis on the DL oriented multi-level IRs, and frontend/backend optimizations. We present detailed analysis on the design of multi-level IRs and illustrate the commonly adopted optimization techniques. Finally, several insights are highlighted as the potential research directions of DL compiler. This is the first survey article focusing on the design architecture of DL compilers, which we hope can pave the road for future research towards DL compiler.

[1]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[2]  Alan Edelman,et al.  Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[3]  G. Shipman,et al.  Omega Library , 2011, Encyclopedia of Parallel Computing.

[4]  Chun Chen,et al.  Polyhedra scanning revisited , 2012, PLDI.

[5]  David E. Goldberg,et al.  Genetic algorithms and Machine Learning , 1988, Machine Learning.

[6]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[7]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[8]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[9]  P. Feautrier Parametric integer programming , 1988 .

[10]  Ashish Agarwal,et al.  TensorFlow Eager: A Multi-Stage, Python-Embedded DSL for Machine Learning , 2019, SysML.

[11]  Sven Verdoolaege Counting Affine Calculator and Applications , 2011 .

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Zheng Wang,et al.  Machine Learning in Compiler Optimization , 2018, Proceedings of the IEEE.

[14]  Bart van Merrienboer,et al.  Automatic differentiation in ML: Where we are and where we should be going , 2018, NeurIPS.

[15]  Madhumitha Nara,et al.  Performance Evaluation of Deep Learning frameworks on Computer Vision problems , 2019, 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI).

[16]  Xi Chen,et al.  FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[17]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[18]  Jie Xu,et al.  DeepBurning: Automatic generation of FPGA-based learning accelerators for the Neural Network family , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[19]  John McCarthy,et al.  LISP 1.5 Programmer's Manual , 1962 .

[20]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[21]  Uday Bondhugula,et al.  MLIR: A Compiler Infrastructure for the End of Moore's Law , 2020, ArXiv.

[22]  Albert Cohen,et al.  Polyhedral Code Generation in the Real World , 2006, CC.

[23]  Tim Zerrell,et al.  Stripe: Tensor Compilation via the Nested Polyhedral Model , 2019, ArXiv.

[24]  Gu-Yeon Wei,et al.  Benchmarking TPU, GPU, and CPU Platforms for Deep Learning , 2019, ArXiv.

[25]  Thierry Moreau,et al.  Learning to Optimize Tensor Programs , 2018, NeurIPS.

[26]  Yu Wang,et al.  Angel-Eye: A Complete Design Flow for Mapping CNN Onto Embedded FPGA , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[27]  Yu Cao,et al.  ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler , 2018, Integr..

[28]  Seung-Jong Park,et al.  Evaluation of Deep Learning Frameworks Over Different HPC Architectures , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[29]  Jing Xia,et al.  DaVinci: A Scalable Architecture for Neural Network Computing , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[30]  Thierry Moreau,et al.  Relay: A High-Level Compiler for Deep Learning , 2019 .

[31]  Li Li,et al.  An Orchestrated Empirical Study on Deep Learning Frameworks and Platforms , 2018, ArXiv.

[32]  Samuel J. Kaufman,et al.  Learned TPU Cost Model for XLA Tensor Programs , 2019 .

[33]  Thierry Moreau,et al.  A Hardware–Software Blueprint for Flexible Deep Learning Specialization , 2018, IEEE Micro.

[34]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[35]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[36]  Jon Harrop F# for Scientists , 2008 .

[37]  Peter Rossmanith,et al.  Simulated Annealing , 2008, Taschenbuch der Algorithmen.

[38]  Martín Abadi,et al.  Dynamic control flow in large-scale machine learning , 2018, EuroSys.

[39]  Kathryn A. Dowsland,et al.  Simulated Annealing , 1989, Encyclopedia of GIS.

[40]  Jonathan Rees,et al.  Revised3 report on the algorithmic language scheme , 1986, SIGP.

[41]  Hailong Yang,et al.  Privacy for Rescue: A New Testimony Why Privacy is Vulnerable In Deep Models , 2019, ArXiv.

[42]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[43]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[44]  Mary W. Hall,et al.  Loop and data transformations for sparse matrix code , 2015, PLDI.

[45]  Daniele Paolo Scarpazza,et al.  Dissecting the Graphcore IPU Architecture via Microbenchmarking , 2019, ArXiv.

[46]  Wei Lin,et al.  FusionStitching: Boosting Execution Efficiency of Memory Intensive Computations for DL Workloads , 2019, ArXiv.

[47]  Daniel Goodman,et al.  JavaScript Bible , 1996 .

[48]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[49]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[50]  Wei Yu,et al.  A Survey of Deep Learning: Platforms, Applications and Emerging Research Trends , 2018, IEEE Access.

[51]  Jian Weng,et al.  An In-depth Comparison of Compilers for Deep Neural Networks on Hardware , 2019, 2019 IEEE International Conference on Embedded Software and Systems (ICESS).

[52]  R. Kent Dybvig,et al.  Revised5 Report on the Algorithmic Language Scheme , 1986, SIGP.

[53]  Yu Wang,et al.  A Survey of FPGA-Based Neural Network Accelerator , 2017, 1712.08934.

[54]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[55]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[56]  Amit Agarwal,et al.  CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.

[57]  Barak A. Pearlmutter,et al.  Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator , 2008, TOPL.

[58]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[59]  Oliver Schulte,et al.  The CTU Prague Relational Learning Repository , 2015, ArXiv.

[60]  Jeonghee Kim,et al.  Large-Scale Item Categorization in e-Commerce Using Multiple Recurrent Neural Networks , 2016, KDD.

[61]  Roberto Bagnara,et al.  The Parma Polyhedra Library: Toward a complete set of numerical abstractions for the analysis and verification of hardware and software systems , 2006, Sci. Comput. Program..

[62]  Thomas Blaschke,et al.  The rise of deep learning in drug discovery. , 2018, Drug discovery today.

[63]  Bertrand A. Maher,et al.  Glow: Graph Lowering Compiler Techniques for Neural Networks , 2018, ArXiv.

[64]  Asit K. Mishra,et al.  From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[65]  Thierry Moreau,et al.  Graph Optimizer Tensor Optimizer VTA JIT Runtime VTA ISA VTA MicroArchitecture , 2018 .

[66]  Dong Han,et al.  Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[67]  Mohak Shah,et al.  Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning , 2015, ArXiv.

[68]  Michael Innes,et al.  Fashionable Modelling with Flux , 2018, ArXiv.

[69]  Christos-Savvas Bouganis,et al.  fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[70]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[71]  Mary W. Hall,et al.  Non-affine Extensions to Polyhedral Code Generation , 2014, CGO '14.

[72]  Jun Yang,et al.  FusionStitching: Deep Fusion and Code Generation for Tensorflow Computations on GPUs , 2018, ArXiv.

[73]  Mohak Shah,et al.  Comparative Study of Deep Learning Software Frameworks , 2015, 1511.06435.

[74]  Sven Verdoolaege,et al.  isl: An Integer Set Library for the Polyhedral Model , 2010, ICMS.

[75]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[76]  Bart van Merriënboer,et al.  Automatic Differentiation in Myia , 2017 .

[77]  Jinwon Lee,et al.  Ordering Chaos: Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices , 2020, MLSys.

[78]  Wayne Luk,et al.  Hardware Compilation of Deep Neural Networks: An Overview , 2018, 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[79]  Zhiyuan Liu,et al.  Graph Neural Networks: A Review of Methods and Applications , 2018, AI Open.

[80]  Vincent Loechner PolyLib: A Library for Manipulating Parameterized Polyhedra , 1999 .

[81]  Hamed Haddadi,et al.  Deep Private-Feature Extraction , 2018, IEEE Transactions on Knowledge and Data Engineering.

[82]  Pradeep Dubey,et al.  A Study of BFLOAT16 for Deep Learning Training , 2019, ArXiv.

[83]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[84]  Elliot Saba,et al.  Automatic Full Compilation of Julia Programs and ML Models to Cloud TPUs , 2018, ArXiv.

[85]  Yong Dou,et al.  Automatic code generation of convolutional neural networks in FPGA implementation , 2016, 2016 International Conference on Field-Programmable Technology (FPT).

[86]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[87]  Robert Hieb,et al.  Revised 5 Report on the Algorithmic Language , 1999 .

[88]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[89]  Hadi Esmaeilzadeh,et al.  Shredder: Learning Noise Distributions to Protect Inference Privacy , 2020, ASPLOS.

[90]  George Karypis,et al.  Tensor-matrix products with a compressed sparse tensor , 2015, IA3@SC.

[91]  Takuya Akiba,et al.  Chainer: A Deep Learning Framework for Accelerating the Research Cycle , 2019, KDD.

[92]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[93]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[94]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[95]  M. Pelcat,et al.  Tactics to Directly Map CNN Graphs on Embedded FPGAs , 2017, IEEE Embedded Systems Letters.

[96]  BouganisChristos-Savvas,et al.  Toolflows for Mapping Convolutional Neural Networks on FPGAs , 2018 .

[97]  Soonhoi Ha,et al.  C-GOOD: C-code Generation Framework for Optimized On-device Deep Learning , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[98]  Rubén D. Fonnegra,et al.  Performance comparison of deep learning frameworks in image classification problems using convolutional and recurrent networks , 2017, 2017 IEEE Colombian Conference on Communications and Computing (COLCOM).

[99]  Yida Wang,et al.  Optimizing CNN Model Inference on CPUs , 2018, USENIX Annual Technical Conference.

[100]  Mohsen Guizani,et al.  Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services , 2018, IEEE Internet of Things Journal.

[101]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[102]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[103]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[104]  Hadi Esmaeilzadeh,et al.  Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation , 2020, ICLR.