Evaluating the Performance of Reinforcement Learning Algorithms

Performance evaluations are critical for quantifying algorithmic advances in reinforcement learning. Recent reproducibility analyses have shown that reported performance results are often inconsistent and difficult to replicate. In this work, we argue that the inconsistency of performance stems from the use of flawed evaluation metrics. Taking a step towards ensuring that reported results are consistent, we propose a new comprehensive evaluation methodology for reinforcement learning algorithms that produces reliable measurements of performance both on a single environment and when aggregated across environments. We demonstrate this method by evaluating a broad class of reinforcement learning algorithms on standard benchmark tasks.

[1]  Michael L. Littman,et al.  Measuring and Characterizing Generalization in Deep Reinforcement Learning , 2018, Applied AI Letters.

[2]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[3]  Iryna Gurevych,et al.  Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[4]  Philip J. Fleming,et al.  How not to lie with statistics: the correct way to summarize benchmark results , 1986, CACM.

[5]  Joelle Pineau,et al.  RE-EVALUATE: Reproducibility in Evaluating Reinforcement Learning Algorithms , 2018 .

[6]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[7]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[8]  Paul Van Dooren,et al.  Maximizing PageRank via outlinks , 2007, ArXiv.

[9]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[10]  Peter Henderson,et al.  Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control , 2017, ArXiv.

[11]  Majid Nili Ahmadabadi,et al.  Interaction of Culture-based Learning and Cooperative Co-evolution and its Application to Automatic Behavior-based System Design , 2010, IEEE Transactions on Evolutionary Computation.

[12]  Shimon Whiteson,et al.  Report on the 2008 Reinforcement Learning Competition , 2010, AI Mag..

[13]  Alborz Geramifard,et al.  RLPy: a value-function-based reinforcement learning framework for education and research , 2015, J. Mach. Learn. Res..

[14]  Joelle Pineau,et al.  Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program) , 2020, J. Mach. Learn. Res..

[15]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[16]  John Foley,et al.  ToyBox: Better Atari Environments for Testing Reinforcement Learning Agents , 2018, ArXiv.

[17]  T. W. Anderson CONFIDENCE LIMITS FOR THE EXPECTED VALUE OF AN ARBITRARY BOUNDED RANDOM VARIABLE WITH A CONTINUOUS DISTRIBUTION FUNCTION , 1969 .

[18]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[19]  Balázs Csanád Csáji,et al.  PageRank optimization by edge selection , 2009, Discret. Appl. Math..

[20]  Christos Dimitrakakis,et al.  The Reinforcement Learning Competition 2014 , 2014, AI Mag..

[21]  Catherine C. McGeoch A Guide to Experimental Algorithmics , 2012 .

[22]  Razvan V. Florian,et al.  Correct equations for the dynamics of the cart-pole system , 2005 .

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Zachary C. Lipton,et al.  Troubling Trends in Machine Learning Scholarship , 2018, ACM Queue.

[25]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[26]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[27]  André da Motta Salles Barreto,et al.  Probabilistic performance profiles for the experimental evaluation of stochastic algorithms , 2010, GECCO '10.

[28]  Jon D. McAuliffe,et al.  Uniform, nonparametric, non-asymptotic confidence sequences , 2018 .

[29]  Pierre-Yves Oudeyer,et al.  How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments , 2018, ArXiv.

[30]  Alan Edelman,et al.  Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[31]  Will Dabney,et al.  ADAPTIVE STEP-SIZES FOR REINFORCEMENT LEARNING , 2014 .

[32]  W. Bruce Croft,et al.  Distributed Evaluations: Ending Neural Point Metrics , 2018, ArXiv.

[33]  Stéphane Gaubert,et al.  Ergodic Control and Polyhedral Approaches to PageRank Optimization , 2010, IEEE Transactions on Automatic Control.

[34]  George Konidaris,et al.  Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[35]  Christos Dimitrakakis,et al.  The reinforcement learning competition , 2014 .

[36]  Rudolf Fleischer,et al.  Experimental Algorithmics, From Algorithm Design to Robust and Efficient Software [Dagstuhl seminar, September 2000] , 2002 .

[37]  Shimon Whiteson,et al.  The Reinforcement Learning Competitions , 2010 .

[38]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[39]  David R. Cox,et al.  The Oxford Dictionary of Statistical Terms , 2006 .

[40]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[41]  Mario Lucic,et al.  Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[42]  Shimon Whiteson,et al.  Protecting against evaluation overfitting in empirical reinforcement learning , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[43]  Sergey Levine,et al.  The Mirage of Action-Dependent Baselines in Reinforcement Learning , 2018, ICML.

[44]  John N. Hooker,et al.  Testing heuristics: We have it all wrong , 1995, J. Heuristics.

[45]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[46]  Christos H. Papadimitriou,et al.  α-Rank: Multi-Agent Evaluation by Evolution , 2019, Scientific Reports.

[47]  Shimon Whiteson,et al.  Introduction to the special issue on empirical evaluations in reinforcement learning , 2011, Machine Learning.

[48]  Jorge J. Moré,et al.  Digital Object Identifier (DOI) 10.1007/s101070100263 , 2001 .

[49]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[50]  Uta Boehm,et al.  Experimental Algorithmics From Algorithm Design To Robust And Efficient Software , 2016 .

[51]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[52]  Patrick M. Pilarski,et al.  Model-Free reinforcement learning with continuous action in practice , 2012, 2012 American Control Conference (ACC).

[53]  Marc G. Bellemare,et al.  A Comparative Analysis of Expected and Distributional Reinforcement Learning , 2019, AAAI.

[54]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[55]  Scott M. Jordan Using Cumulative Distribution Based Performance Analysis to Benchmark Models , 2018 .

[56]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[57]  Andrew G. Barto,et al.  Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[58]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[59]  Marco Wiering,et al.  Convergence and Divergence in Standard and Averaging Reinforcement Learning , 2004, ECML.

[60]  Kaleigh Clary,et al.  Exploratory Not Explanatory: Counterfactual Analysis of Saliency Maps for Deep Reinforcement Learning , 2020, ICLR.

[61]  Michal Valko,et al.  Multiagent Evaluation under Incomplete Information , 2019, NeurIPS.