论文信息 - REINFORCEMENT LEARNING ALGORITHMS

REINFORCEMENT LEARNING ALGORITHMS

Lack of reliability is a well-known issue for reinforcement learning (RL) algorithms. This problem has gained increasing attention in recent years, and efforts to improve it have grown substantially. To aid RL researchers and production users with the evaluation and improvement of reliability, we propose a set of metrics that quantitatively measure different aspects of reliability. In this work, we focus on variability and risk, both during training and after learning (on a fixed policy). We designed these metrics to be general-purpose, and we also designed complementary statistical tests to enable rigorous comparisons on these metrics. In this paper, we first describe the desired properties of the metrics and their design, the aspects of reliability that they measure, and their applicability to different scenarios. We then describe the statistical tests and make additional practical recommendations for reporting results. The metrics and accompanying statistical tools have been made available as an open-source library.1 We apply our metrics to a set of common RL algorithms and environments, compare them, and analyze the results.

Stephanie C. Y. Chan | J. Canny | S. Guadarrama | Sam Fishman | Anoop Korattikara

[1] Jonathan D. Cryer,et al. Time Series Analysis , 1986 .

[2] Glenn D. Rudebusch. Trends and Random Walks in Macroeconomic Time Series: , 2020, Business Cycles.

[3] Pierre-Yves Oudeyer,et al. A Hitchhiker's Guide to Statistical Comparisons of Reinforcement Learning Algorithms , 2019, RML@ICLR.

[4] Marc G. Bellemare,et al. Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[5] Peter Stone,et al. Deterministic Implementations for Reproducibility in Deep Reinforcement Learning , 2018, ArXiv.

[6] Pierre-Yves Oudeyer,et al. How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments , 2018, ArXiv.

[7] Rémi Munos,et al. Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[8] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[9] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[10] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[11] Marlos C. Machado,et al. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[12] Shane Legg,et al. Noisy Networks for Exploration , 2017, ICLR.

[13] D. Sculley,et al. Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[14] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[15] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[16] Peter Henderson,et al. Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control , 2017, ArXiv.

[17] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[18] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[19] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[20] Shie Mannor,et al. Optimizing the CVaR via Sampling , 2014, AAAI.

[21] Mohammad Ghavamzadeh,et al. Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[22] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23] H. Beek. F1000Prime recommendation of False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. , 2012 .

[24] Nicole Bäuerle,et al. Markov Decision Processes with Average-Value-at-Risk criteria , 2011, Math. Methods Oper. Res..

[25] S. Uryasev,et al. Drawdown Measure in Portfolio Optimization , 2003 .

[26] D. Tasche,et al. Expected Shortfall: a natural coherent alternative to Value at Risk , 2001, cond-mat/0105191.

[27] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[28] Robert Tibshirani,et al. Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[29] D. Dickey,et al. Testing for unit roots in autoregressive-moving average models of unknown order , 1984 .