Lorien: Efficient Deep Learning Workloads Delivery

Modern deep learning systems embrace the compilation idea to self generate code of a deep learning model to catch up the rapidly changed deep learning operators and newly emerged hardware platforms. The performance of the self-generated code is guaranteed via auto-tuning frameworks which normally take a long time to find proper execution schedules for the given operators, which hurts both user experiences and time-to-the-market in terms of model developments and deployments. To efficiently deliver a high-performance schedule upon requests, in this paper, we present Lorien, an open source infrastructure, to tune the operators and orchestrate the tuned schedules in a systematic way. Lorien is designed to be extensible to state-of-the-art auto-tuning frameworks, and scalable to coordinate a number of compute resources for its tuning tasks with fault tolerance. We leveraged Lorien to extract thousands of operator-level tuning tasks from 29 widely-used models in Gluon CV model zoo [22], and tune them on x86 CPU, ARM CPU, and NVIDIA GPU to construct a database for queries. In addition, to deliver reasonably high performance schedules for unseen workloads in seconds or minutes, Lorien integrates an AutoML solution to train a performance cost model with collected large-scale datasets. Our evaluation shows that the AutoML-based solution is accurate enough to enable zero-shot tuning, which does not fine-tune the cost model during tuning nor perform on-device measurements, and is able to find decent schedules with at least 10x less time than existing auto-tuning frameworks.

[1]  Ramakrishna Upadrasta,et al.  PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives , 2020, ACM Trans. Archit. Code Optim..

[2]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[3]  Shoaib Kamil,et al.  Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[4]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[5]  Song Han,et al.  HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.

[6]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[7]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[8]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[9]  Song Han,et al.  ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware , 2018, ICLR.

[10]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[11]  Jonathan Ragan-Kelley,et al.  Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[12]  Sander Stuijk,et al.  Schedule Synthesis for Halide Pipelines on GPUs , 2020, ACM Trans. Archit. Code Optim..

[13]  He He,et al.  GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing , 2020, J. Mach. Learn. Res..

[14]  Frédo Durand,et al.  Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..

[15]  Yanqi Zhou,et al.  A Learned Performance Model for the Tensor Processing Unit , 2020, ArXiv.

[16]  Qingquan Song,et al.  Auto-Keras: An Efficient Neural Architecture Search System , 2018, KDD.

[17]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[18]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[19]  Hang Zhang,et al.  AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data , 2020, ArXiv.

[20]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[21]  Richard Veras,et al.  Analytical cache modeling and tilesize optimization for tensor contractions , 2019, SC.

[22]  S. Winograd Arithmetic complexity of computations , 1980 .

[23]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[24]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[25]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[26]  Yun Liang,et al.  FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System , 2020, ASPLOS.

[27]  Thierry Moreau,et al.  Learning to Optimize Tensor Programs , 2018, NeurIPS.

[28]  Julio Delgado,et al.  Elastic Machine Learning Algorithms in Amazon SageMaker , 2020, SIGMOD Conference.

[29]  Albert Cohen,et al.  Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[30]  Andrey Gulin,et al.  Winning The Transfer Learning Track of Yahoo!'s Learning To Rank Challenge with YetiRank , 2010, Yahoo! Learning to Rank Challenge.

[31]  Valerio Schiavoni,et al.  PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep Learning Clusters , 2020, Middleware.

[32]  Mao Yang,et al.  OpEvo: An Evolutionary Method for Tensor Operator Optimization , 2021, AAAI.

[33]  Jia Deng,et al.  Large scale visual recognition , 2012 .

[34]  Cody Hao Yu,et al.  Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.

[35]  Yida Wang,et al.  Optimizing CNN Model Inference on CPUs , 2018, USENIX Annual Technical Conference.

[36]  Karima Benatchba,et al.  A Deep Learning Based Cost Model for Automatic Code Optimization , 2021, MLSys.