Deep Reinforcement Learning at the Edge of the Statistical Precipice

Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few-run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field’s confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable2, to prevent unreliable results from stagnating the field.

[1]  Tal Arbel,et al.  Accounting for Variance in Machine Learning Benchmarks , 2021, MLSys.

[2]  Matteo Hessel,et al.  When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[3]  Mario Lucic,et al.  Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[4]  Pierre-Yves Oudeyer,et al.  A Hitchhiker's Guide to Statistical Comparisons of Reinforcement Learning Algorithms , 2019, RML@ICLR.

[5]  Gerd Gigerenzer,et al.  Statistical Rituals: The Replication Delusion and How We Got There , 2018, Advances in Methods and Practices in Psychological Science.

[6]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[7]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[8]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[9]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[10]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[11]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[12]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[13]  Tom Schaul,et al.  Return-based Scaling: Yet Another Normalisation Trick for Deep RL , 2021, ArXiv.

[14]  Houqiang Li,et al.  Masked Contrastive Representation Learning for Reinforcement Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Jinwoo Shin,et al.  State Entropy Maximization with Random Encoders for Efficient Exploration , 2021, ICML.

[16]  Iryna Gurevych,et al.  Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[17]  Wee Sun Lee,et al.  Ensemble and Auxiliary Tasks for Data-Efficient Deep Reinforcement Learning , 2021, ECML/PKDD.

[18]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[19]  Alexander D'Amour,et al.  The MultiBERTs: BERT Reproductions for Robustness Analysis , 2021, ArXiv.

[20]  Animesh Garg,et al.  D2RL: Deep Dense Architectures in Reinforcement Learning , 2020, ArXiv.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  B. L. Welch The generalisation of student's problems when several different population variances are involved. , 1947, Biometrika.

[23]  Jimmy J. Lin,et al.  Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard , 2021, SIGIR.

[24]  S. Goodman,et al.  Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations , 2016, European Journal of Epidemiology.

[25]  Mauro Birattari,et al.  How to assess and report the performance of a stochastic algorithm on a benchmark problem: mean or best result on a number of runs? , 2007, Optim. Lett..

[26]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[27]  Benjamin Recht,et al.  Simple random search provides a competitive approach to reinforcement learning , 2018, ArXiv.

[28]  J. Tukey A survey of sampling from contaminated distributions , 1960 .

[29]  Pieter Abbeel,et al.  Behavior From the Void: Unsupervised Active Pre-Training , 2021, ArXiv.

[30]  John Schulman,et al.  Phasic Policy Gradient , 2020, ICML.

[31]  Fabien Moutarde,et al.  Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the playing field , 2019 .

[32]  Joelle Pineau,et al.  Improving Sample Efficiency in Model-Free Reinforcement Learning from Images , 2019, AAAI.

[33]  Greg Wayne,et al.  Synthetic Returns for Long-Term Credit Assignment , 2021, ArXiv.

[34]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[35]  Jiashi Feng,et al.  Improving Generalization in Reinforcement Learning with Mixture Regularization , 2020, NeurIPS.

[36]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[37]  Edward Grefenstette,et al.  Prioritized Level Replay , 2020, ICML.

[38]  Samuel Ritter,et al.  Cognitive Psychology for Deep Neural Networks: A Shape Bias Case Study , 2017, ICML.

[39]  Pieter Abbeel,et al.  APS: Active Pretraining with Successor Features , 2021, ICML.

[40]  Peter Henderson,et al.  Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control , 2017, ArXiv.

[41]  Sameera S. Ponda,et al.  Autonomous navigation of stratospheric balloons using reinforcement learning , 2020, Nature.

[42]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[43]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[44]  F. Götze,et al.  RESAMPLING FEWER THAN n OBSERVATIONS: GAINS, LOSSES, AND REMEDIES FOR LOSSES , 2012 .

[45]  Joelle Pineau,et al.  Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program) , 2020, J. Mach. Learn. Res..

[46]  Pieter Abbeel,et al.  Reinforcement Learning with Augmented Data , 2020, NeurIPS.

[47]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[48]  Pascal Vincent,et al.  Unreproducible Research is Reproducible , 2019, ICML.

[49]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[50]  Ilya Kostrikov,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ArXiv.

[51]  Scott M. Jordan,et al.  Evaluating the Performance of Reinforcement Learning Algorithms , 2020, ICML.

[52]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[53]  Honglak Lee,et al.  Predictive Information Accelerates Learning in RL , 2020, NeurIPS.

[54]  Rob Fergus,et al.  Decoupling Value and Policy for Generalization in Reinforcement Learning , 2021, ICML.

[55]  Pieter Abbeel,et al.  CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[56]  Balaraman Ravindran,et al.  SEERL: Sample Efficient Ensemble Reinforcement Learning , 2021, AAMAS.

[57]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[58]  Veronika Cheplygina,et al.  How I failed machine learning in medical imaging - shortcomings and recommendations , 2021, ArXiv.

[59]  B. Efron Better Bootstrap Confidence Intervals , 1987 .

[60]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[61]  J. Schulman,et al.  Leveraging Procedural Generation to Benchmark Reinforcement Learning , 2019, ICML.

[62]  Rotem Dror,et al.  Deep Dominance - How to Properly Compare Deep Neural Models , 2019, ACL.

[63]  John Hallam,et al.  A Survey on Reproducibility by Evaluating Deep Reinforcement Learning Algorithms on Real-World Robots , 2019, CoRL.

[64]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[65]  David Silver,et al.  Muesli: Combining Improvements in Policy Optimization , 2021, ICML.

[66]  Lukasz Kaiser,et al.  Q-Value Weighted Regression: Reinforcement Learning with Limited Data , 2021, 2022 International Joint Conference on Neural Networks (IJCNN).

[67]  Mohammad Norouzi,et al.  Mastering Atari with Discrete World Models , 2020, ICLR.

[68]  Ilya Kostrikov,et al.  Automatic Data Augmentation for Generalization in Deep Reinforcement Learning , 2020, ArXiv.

[69]  H. Levy Stochastic dominance and expected utility: survey and analysis , 1992 .

[70]  Sara Hooker,et al.  Randomness In Neural Network Training: Characterizing The Impact of Tooling , 2021, MLSys.

[71]  Daniel Guo,et al.  Agent57: Outperforming the Atari Human Benchmark , 2020, ICML.

[72]  Peter Stone,et al.  Deterministic Implementations for Reproducibility in Deep Reinforcement Learning , 2018, ArXiv.

[73]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[74]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[75]  David Gal,et al.  Abandon Statistical Significance , 2017, The American Statistician.

[76]  D. Romer,et al.  In Praise of Confidence Intervals , 2020, AEA Papers and Proceedings.

[77]  Sergey Levine,et al.  Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model , 2019, NeurIPS.

[78]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[79]  Kacper Kielak Do recent advancements in model-based deep reinforcement learning really improve data efficiency? , 2019 .

[80]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[81]  Fernando Diaz,et al.  The Benchmark Lottery , 2021, ArXiv.

[82]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[83]  Mohammad Norouzi,et al.  An Optimistic Perspective on Offline Reinforcement Learning , 2020, ICML.

[84]  Oleksii Hrinchuk,et al.  Catalyst.RL: A Distributed Framework for Reproducible RL Research , 2019, ArXiv.

[85]  O. Pietquin,et al.  Munchausen Reinforcement Learning , 2020, NeurIPS.

[86]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[87]  Pablo Samuel Castro,et al.  Revisiting Rainbow: Promoting more insightful and inclusive deep reinforcement learning research , 2021, ICML.

[88]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[89]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[90]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[91]  David Warde-Farley,et al.  Fast Task Inference with Variational Intrinsic Successor Features , 2019, ICLR.

[92]  Pieter Abbeel,et al.  SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning , 2021, ICML.

[93]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[94]  John Foley,et al.  Let's Play Again: Variability of Deep Reinforcement Learning Agents in Atari Environments , 2019, ArXiv.

[95]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[96]  Pierre-Yves Oudeyer,et al.  How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments , 2018, ArXiv.

[97]  Nenghai Yu,et al.  Return-Based Contrastive Representation Learning for Reinforcement Learning , 2021, ICLR.

[98]  John F. Canny,et al.  Measuring the Reliability of Reinforcement Learning Algorithms , 2019, ICLR.

[99]  N. Lazar,et al.  Moving to a World Beyond “p < 0.05” , 2019, The American Statistician.

[100]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[101]  R Devon Hjelm,et al.  Data-Efficient Reinforcement Learning with Self-Predictive Representations , 2020 .

[102]  Sander Greenland,et al.  Scientists rise up against statistical significance , 2019, Nature.

[103]  Ankush Gupta,et al.  Unsupervised Learning of Object Keypoints for Perception and Control , 2019, NeurIPS.

[104]  Jorge J. Moré,et al.  Digital Object Identifier (DOI) 10.1007/s101070100263 , 2001 .

[105]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[106]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[107]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[108]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.