A unified view of entropy-regularized Markov decision processes

We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of Schulman et al. (2015) actually converges to the optimal policy, while the entropy-regularized policy gradient methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally, we illustrate empirically the effects of using various regularization techniques on learning performance in a simple reinforcement learning setup.

[1]  Apostolos Burnetas,et al.  Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..

[2]  R. Howard,et al.  Risk-Sensitive Markov Decision Processes , 1972 .

[3]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[4]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..

[5]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[6]  Vicenç Gómez,et al.  Fast rates for online learning in Linearly Solvable Markov Decision Processes , 2017, COLT.

[7]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[8]  Kavosh Asadi,et al.  A New Softmax Operator for Reinforcement Learning , 2016, ArXiv.

[9]  Stefan Schaal,et al.  Path integral control and bounded rationality , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[10]  Matthieu Geist,et al.  Approximate Modified Policy Iteration , 2012, ICML.

[11]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[12]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[14]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[15]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annual Reviews in Control.

[16]  Sanjeev Arora,et al.  The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[17]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[18]  Sergey Levine,et al.  Guided Policy Search via Approximate Mirror Descent , 2016, NIPS.

[19]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[20]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[21]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[22]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[23]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[24]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[25]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[26]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[27]  Marek Petrik,et al.  An Approximate Solution Method for Large Risk-Averse Markov Decision Processes , 2012, UAI.

[28]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[29]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[30]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[31]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[32]  Andrzej Ruszczynski,et al.  Risk-averse dynamic programming for Markov decision processes , 2010, Math. Program..

[33]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[34]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[35]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[36]  Anind K. Dey,et al.  Modeling Interaction via the Principle of Maximum Causal Entropy , 2010, ICML.

[37]  András György,et al.  Online Learning in Markov Decision Processes with Changing Cost Sequences , 2014, ICML.

[38]  Koray Kavukcuoglu,et al.  PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[39]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[40]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[41]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[42]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[43]  H. Brendan McMahan,et al.  A survey of Algorithms and Analysis for Adaptive Online Learning , 2014, J. Mach. Learn. Res..

[44]  Ambuj Tewari,et al.  Composite objective mirror descent , 2010, COLT 2010.

[45]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[46]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[47]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[48]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[49]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .

[50]  N. Roy,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2013 .

[51]  Vicenç Gómez,et al.  Dynamic Policy Programming with Function Approximation , 2011, AISTATS.

[52]  B. Martinet Perturbation des méthodes d'optimisation. Applications , 1978 .

[53]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.