A Learned Performance Model for the Tensor Processing Unit

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as an minimization objective, or by autotuners to find an optimal configuration of a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for the Tensor Processing Unit (TPU). We train a neural network over kernel-level sub-graphs from the corpus and find that the learned model is competitive to a heavily-optimized analytical cost model used in the production XLA compiler.

[1]  Frank Hutter,et al.  Efficient Multi-Objective Neural Architecture Search via Lamarckian Evolution , 2018, ICLR.

[2]  Frank Hutter,et al.  Multi-objective Architecture Search for CNNs , 2018, ArXiv.

[3]  Wei Wei,et al.  2019 Formatting Instructions for Authors Using LaTeX , 2018 .

[4]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[5]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[6]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[7]  Michael Carbin,et al.  Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks , 2018, ICML.

[8]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[9]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[10]  Frédo Durand,et al.  Learning to optimize halide with tree search and random programs , 2019, ACM Trans. Graph..

[11]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[12]  Michael F. P. O'Boyle,et al.  Fast compiler optimisation evaluation using code-feature based performance prediction , 2007, CF '07.

[13]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[14]  Quoc V. Le,et al.  Chip Placement with Deep Reinforcement Learning , 2020, ArXiv.

[15]  Matei Zaharia,et al.  Optimizing DNN Computation with Relaxed Graph Substitutions , 2019, MLSys.

[16]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[17]  Olivier Temam,et al.  Chaos in computer performance , 2005, Chaos.

[18]  Junjie Yan,et al.  Peephole: Predicting Network Performance Before Training , 2017, ArXiv.

[19]  Constantine Bekas,et al.  TAPAS: Train-less Accuracy Predictor for Architecture Search , 2018, AAAI.

[20]  Thierry Moreau,et al.  Learning to Optimize Tensor Programs , 2018, NeurIPS.

[21]  Hanxiao Liu,et al.  Neural Predictor for Neural Architecture Search , 2019, ECCV.

[22]  Matthew Johnson,et al.  Compiling machine learning programs via high-level tracing , 2018 .