Transition Based Discount Factor for Model Free Algorithms in Reinforcement Learning

Reinforcement Learning (RL) enables an agent to learn control policies for achieving its long-term goals. One key parameter of RL algorithms is a discount factor that scales down future cost in the state’s current value estimate. This study introduces and analyses a transition-based discount factor in two model-free reinforcement learning algorithms: Q-learning and SARSA, and shows their convergence using the theory of stochastic approximation for finite state and action spaces. This causes an asymmetric discounting, favouring some transitions over others, which allows (1) faster convergence than constant discount factor variant of these algorithms, which is demonstrated by experiments on the Taxi domain and MountainCar environments; (2) provides better control over the RL agents to learn risk-averse or risk-taking policy, as demonstrated in a Cliff Walking experiment.

[1]  Onder Tutsoy,et al.  Chaotic dynamics and convergence analysis of temporal difference algorithms with bang‐bang control , 2016 .

[2]  Tomás Prieto-Rumeau,et al.  Discrete-time control with non-constant discount factor , 2020, Math. Methods Oper. Res..

[3]  Martin Brown,et al.  Reinforcement learning analysis for a minimum time balance problem , 2016 .

[4]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[5]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[6]  Xianping Guo,et al.  Markov decision processes with state-dependent discount factors and unbounded rewards/costs , 2011, Oper. Res. Lett..

[7]  Ling Shi,et al.  Deep Reinforcement Learning for Wireless Sensor Scheduling in Cyber-Physical Systems , 2018, Autom..

[8]  Shimon Whiteson,et al.  Learning Retrospective Knowledge with Reverse Reinforcement Learning , 2020, NeurIPS.

[9]  Elif Surer,et al.  Using Generative Adversarial Nets on Atari Games for Feature Extraction in Deep Reinforcement Learning , 2020, 2020 28th Signal Processing and Communications Applications Conference (SIU).

[10]  Damien Ernst,et al.  How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies , 2015, ArXiv.

[11]  K. Doya,et al.  The Role of Serotonin in the Regulation of Patience and Impulsivity , 2012, Molecular Neurobiology.

[12]  John Stachurski,et al.  Dynamic programming with state-dependent discounting , 2019, J. Econ. Theory.

[13]  Shuang Li,et al.  Reinforcement Learning Approach to Design Practical Adaptive Control for a Small-Scale Intelligent Vehicle , 2019, Symmetry.

[14]  Min Oh,et al.  Deep reinforcement learning optimization framework for a power generation plant considering performance and environmental issues , 2021 .

[15]  Daniel Kroening,et al.  Cautious Reinforcement Learning with Logical Constraints , 2020, AAMAS.

[16]  Daniel Kroening,et al.  Deep Reinforcement Learning with Temporal Logics , 2020, FORMATS.

[17]  Silviu Pitis,et al.  Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach , 2019, AAAI.

[18]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[19]  Qi Sun,et al.  Hierarchical Reinforcement Learning for Self-Driving Decision-Making without Reliance on Labeled Driving Data , 2020, IET Intelligent Transport Systems.

[20]  Juan González-Hernández,et al.  Markov control processes with randomized discounted cost , 2007, Math. Methods Oper. Res..

[21]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[22]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[23]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[24]  Rajesh Elara Mohan,et al.  Complete coverage path planning using reinforcement learning for Tetromino based cleaning and maintenance robot , 2020 .