Temporal Regularization in Markov Decision Process

Several applications of Reinforcement Learning suffer from instability due to high variance. This is especially prevalent in high dimensional domains. Regularization is a commonly used technique in machine learning to reduce variance, at the cost of introducing some bias. Most existing regularization techniques focus on spatial (perceptual) regularization. Yet in reinforcement learning, due to the nature of the Bellman equation, there is an opportunity to also exploit temporal regularization based on smoothness in value estimates over trajectories. This paper explores a class of methods for temporal regularization. We formally characterize the bias induced by this technique using Markov chain concepts. We illustrate the various characteristics of temporal regularization via a sequence of simple discrete and continuous MDPs, and show that the technique provides improvement even in high-dimensional Atari games.

[1]  Everette S. Gardner,et al.  Exponential smoothing: The state of the art , 1985 .

[2]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[4]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[5]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[6]  John N. Tsitsiklis,et al.  On Average Versus Discounted Reward Temporal-Difference Learning , 2002, Machine Learning.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[9]  E. S. Gardner EXPONENTIAL SMOOTHING: THE STATE OF THE ART, PART II , 2006 .

[10]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[11]  Lihong Li,et al.  A worst-case comparison between temporal difference and residual gradient with linear function approximation , 2008, ICML '08.

[12]  Shie Mannor,et al.  Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[13]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[14]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[15]  Marek Petrik,et al.  Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes , 2010, ICML.

[16]  Joelle Pineau,et al.  Informing sequential clinical decision-making through reinforcement learning: an empirical study , 2010, Machine Learning.

[17]  Csaba Szepesvari,et al.  Regularization in reinforcement learning , 2011 .

[18]  Jason Pazis,et al.  Non-Parametric Approximate Linear Programming for MDPs , 2011, AAAI.

[19]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[20]  Bo Liu,et al.  Regularized Off-Policy TD-Learning , 2012, NIPS.

[21]  Kai-Min Chung,et al.  Chernoff-Hoeffding Bounds for Markov Chains: Generalized and Simplified , 2012, STACS.

[22]  Ryan Shaun Joazeiro de Baker,et al.  New Potentials for Data-Driven Intelligent Tutoring System Development and Optimization , 2013, AI Mag..

[23]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[24]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[25]  Cosmo Harrigan Deep Reinforcement Learning with Regularized Convolutional Neural Fitted Q Iteration , 2016 .

[26]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[27]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[28]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[29]  Tom Schaul,et al.  Natural Value Approximators: Learning when to Trust Past Estimates , 2017, NIPS.

[30]  Barbara E. Engelhardt,et al.  A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units , 2017, UAI.

[31]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[32]  Jianfeng Gao,et al.  Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access , 2016, ACL.

[33]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[34]  Romain Laroche,et al.  In reinforcement learning, all objective functions are not equal , 2018, ICLR.