论文信息 - Lorien: Efficient Deep Learning Workloads Delivery - 字舞流文

Lorien: Efficient Deep Learning Workloads Delivery

Modern deep learning systems embrace the compilation idea to self generate code of a deep learning model to catch up the rapidly changed deep learning operators and newly emerged hardware platforms. The performance of the self-generated code is guaranteed via auto-tuning frameworks which normally take a long time to find proper execution schedules for the given operators, which hurts both user experiences and time-to-the-market in terms of model developments and deployments. To efficiently deliver a high-performance schedule upon requests, in this paper, we present Lorien, an open source infrastructure, to tune the operators and orchestrate the tuned schedules in a systematic way. Lorien is designed to be extensible to state-of-the-art auto-tuning frameworks, and scalable to coordinate a number of compute resources for its tuning tasks with fault tolerance. We leveraged Lorien to extract thousands of operator-level tuning tasks from 29 widely-used models in Gluon CV model zoo [22], and tune them on x86 CPU, ARM CPU, and NVIDIA GPU to construct a database for queries. In addition, to deliver reasonably high performance schedules for unseen workloads in seconds or minutes, Lorien integrates an AutoML solution to train a performance cost model with collected large-scale datasets. Our evaluation shows that the AutoML-based solution is accurate enough to enable zero-shot tuning, which does not fine-tune the cost model during tuning nor perform on-device measurements, and is able to find decent schedules with at least 10x less time than existing auto-tuning frameworks.

Cody Hao Yu | Haichen Shen | Xingjian Shi | Mu Li | Zhi Chen | Yida Wang

[1] Ramakrishna Upadrasta,et al. PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives , 2020, ACM Trans. Archit. Code Optim..

[2] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[3] Shoaib Kamil,et al. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[4] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[5] Song Han,et al. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.

[6] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[7] Frédo Durand,et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[8] D. Sculley,et al. Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[9] Song Han,et al. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware , 2018, ICLR.

[10] Tianqi Chen,et al. XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[11] Jonathan Ragan-Kelley,et al. Automatically scheduling halide image processing pipelines , 2016, ACM Trans. Graph..

[12] Sander Stuijk,et al. Schedule Synthesis for Halide Pipelines on GPUs , 2020, ACM Trans. Archit. Code Optim..

[13] He He,et al. GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing , 2020, J. Mach. Learn. Res..

[14] Frédo Durand,et al. Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..

[15] Yanqi Zhou,et al. A Learned Performance Model for the Tensor Processing Unit , 2020, ArXiv.

[16] Qingquan Song,et al. Auto-Keras: An Efficient Neural Architecture Search System , 2018, KDD.

[17] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[18] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[19] Hang Zhang,et al. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data , 2020, ArXiv.

[20] Anna Veronika Dorogush,et al. CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[21] Richard Veras,et al. Analytical cache modeling and tilesize optimization for tensor contractions , 2019, SC.

[22] S. Winograd. Arithmetic complexity of computations , 1980 .

[23] Cody Coleman,et al. MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[24] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[25] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[26] Yun Liang,et al. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System , 2020, ASPLOS.

[27] Thierry Moreau,et al. Learning to Optimize Tensor Programs , 2018, NeurIPS.

[28] Julio Delgado,et al. Elastic Machine Learning Algorithms in Amazon SageMaker , 2020, SIGMOD Conference.

[29] Albert Cohen,et al. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions , 2018, ArXiv.

[30] Andrey Gulin,et al. Winning The Transfer Learning Track of Yahoo!'s Learning To Rank Challenge with YetiRank , 2010, Yahoo! Learning to Rank Challenge.

[31] Valerio Schiavoni,et al. PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep Learning Clusters , 2020, Middleware.

[32] Mao Yang,et al. OpEvo: An Evolutionary Method for Tensor Operator Optimization , 2021, AAAI.

[33] Jia Deng,et al. Large scale visual recognition , 2012 .

[34] Cody Hao Yu,et al. Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.

[35] Yida Wang,et al. Optimizing CNN Model Inference on CPUs , 2018, USENIX Annual Technical Conference.

[36] Karima Benatchba,et al. A Deep Learning Based Cost Model for Automatic Code Optimization , 2021, MLSys.