RL-Scope: Cross-Stack Profiling for Deep Reinforcement Learning Workloads

In recent years, deep reinforcement learning (RL) has demonstrated groundbreaking results in robotics, datacenter management, and many other applications. Despite its increasing popularity, there has been little work in understanding system-level bottlenecks in RL workloads. Instead, the common implicit assumption is that RL workloads are similar to classic supervised learning (SL) workloads. Our analysis contradicts this assumption and shows operations considered GPU-heavy in SL spend at most 12.9% of time GPU-bound in RL workloads, with the rest CPU-bound in different layers of the software stack running high-level language code and non-compute code such as ML backend and CUDA API calls. To explain where training time is spent in RL workloads, we propose RL-Scope: an accurate cross-stack profiler that supports multiple ML backends and simulators. In contrast to existing profilers that are limited to a single layer of the software and hardware stack, RL-Scope collects profiling information across the entire stack and scopes it to high-level operations, providing developers and researchers with a complete picture of RL training time. We demonstrate RL-Scope’s utility through several in-depth case studies. First, we compare RL frameworks to quantify the effects of fundamental design choices behind ML backends. For example, we use RL-Scope to measure and explain a 2.3× difference in runtime between equivalent PyTorch and TensorFlow algorithm implementations, and to identify a bottleneck rooted in overly-abstracted algorithm implementations. Next, we survey how training bottlenecks change as we consider different simulators and RL algorithms, and show that on-policy algorithms are at least 3.5× more simulation-bound than off-policy algorithms. Finally, we profile a scale-up workload and demonstrate that GPU utilization metrics reported by commonly-used tools dramatically inflate GPU usage, whereas RL-Scope reports true GPU-bound time. RL-Scope is an open-source tool available at https://github.com/UofT-EcoSystem/rlscope.

[1]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[2]  Aleksandra Faust,et al.  Air Learning: An AI Research Platform for Algorithm-Hardware Benchmarking of Autonomous Aerial Robots , 2019, ArXiv.

[3]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[4]  David Patterson,et al.  MLPerf Training Benchmark , 2019, MLSys.

[5]  Jinjun Xiong,et al.  Across-Stack Profiling and Characterization of Machine Learning Models on GPUs , 2019, ArXiv.

[6]  Yuval Tassa,et al.  Infinite-Horizon Model Predictive Control for Periodic Tasks with Contacts , 2011, Robotics: Science and Systems.

[7]  Joelle Pineau,et al.  Benchmarking Batch Deep Reinforcement Learning Algorithms , 2019, ArXiv.

[8]  Amar Phanishayee,et al.  Benchmarking and Analyzing Deep Neural Network Training , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[9]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[12]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[13]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[14]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[15]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[16]  John Tromp,et al.  Combinatorics of Go , 2006, Computers and Games.

[17]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.