Variance-constrained actor-critic algorithms for discounted and average reward MDPs

In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms that operate on three timescales—a TD critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale, and a dual ascent for Lagrange multipliers on the slowest timescale. In the discounted setting, we point out the difficulty in estimating the gradient of the variance of the return and incorporate simultaneous perturbation approaches to alleviate this. The average setting, on the other hand, allows for an actor update using compatible features to estimate the gradient of the variance. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application.

[1]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[2]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[3]  Vivek S. Borkar,et al.  A Learning Algorithm for Risk-Sensitive Cost , 2008, Math. Oper. Res..

[4]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[5]  Shie Mannor,et al.  Temporal Difference Methods for the Variance of the Reward To Go , 2013, ICML.

[6]  P. Marbach Simulation-Based Methods for Markov Decision Processes , 1998 .

[7]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9]  E. Altman Constrained Markov Decision Processes , 1999 .

[10]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[11]  Shalabh Bhatnagar,et al.  Reinforcement Learning With Function Approximation for Traffic Signal Control , 2011, IEEE Transactions on Intelligent Transportation Systems.

[12]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[13]  Vivek S. Borkar,et al.  Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..

[14]  Shalabh Bhatnagar,et al.  Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization , 2007, TOMC.

[15]  Michael Devetsikiotis,et al.  An adaptive approach to accelerated evaluation of highly available services , 2007, TOMC.

[16]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[17]  Ilya Segal,et al.  Solutions manual for Microeconomic theory : Mas-Colell, Whinston and Green , 1997 .

[18]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[19]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[20]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[21]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[22]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[23]  G. Rappl On Linear Convergence of a Class of Random Search Algorithms , 1989 .

[24]  V. Borkar Learning Algorithms for Risk-Sensitive Control , 2010 .

[25]  Shie Mannor,et al.  Policy Evaluation with Variance Related Risk Criteria in Markov Decision Processes , 2013, ArXiv.

[26]  Andrzej Ruszczynski,et al.  Risk-averse dynamic programming for Markov decision processes , 2010, Math. Program..

[27]  Shalabh Bhatnagar,et al.  Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization , 2005, TOMC.

[28]  A. Koopman,et al.  Simulation and optimization of traffic in a city , 2004, IEEE Intelligent Vehicles Symposium, 2004.

[29]  Ralph Neuneier,et al.  Risk-Sensitive Reinforcement Learning , 1998, Machine Learning.

[30]  Michael C. Fu,et al.  Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences , 2003, TOMC.

[31]  Shie Mannor,et al.  Distributionally Robust Markov Decision Processes , 2010, Math. Oper. Res..

[32]  Klaus Obermayer,et al.  A Unified Framework for Risk-sensitive Markov Decision Processes with Finite State and Action Spaces , 2011, ArXiv.

[33]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[34]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[35]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[36]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[37]  Philip E. Gill,et al.  Practical optimization , 1981 .

[38]  Morris W. Hirsch,et al.  Convergent activation dynamics in continuous time networks , 1989, Neural Networks.

[39]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[40]  Jack L. Treynor,et al.  MUTUAL FUND PERFORMANCE* , 2007 .

[41]  Nathaniel Korda,et al.  On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence , 2014, ICML.

[42]  John N. Tsitsiklis,et al.  Algorithmic aspects of mean-variance optimization in Markov decision processes , 2013, Eur. J. Oper. Res..

[43]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[44]  James C. Spall,et al.  A one-measurement form of simultaneous perturbation stochastic approximation , 1997, Autom..

[45]  Shalabh Bhatnagar An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes , 2010, Syst. Control. Lett..

[46]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[47]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[48]  J. Dippon,et al.  Weighted Means in Stochastic Approximation of Minima , 1997 .

[49]  Paul R. Milgrom,et al.  Envelope Theorems for Arbitrary Choice Sets , 2002 .

[50]  Jerzy A. Filar,et al.  Variance-Penalized Markov Decision Processes , 1989, Math. Oper. Res..

[51]  Vivek S. Borkar,et al.  A sensitivity formula for risk-sensitive cost and the actor-critic algorithm , 2001, Syst. Control. Lett..

[52]  Shie Mannor,et al.  Variance Adjusted Actor Critic Algorithms , 2013, ArXiv.

[53]  Vivek S. Borkar,et al.  An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[54]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[55]  Mohammad Ghavamzadeh,et al.  Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[56]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[57]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[58]  Shalabh Bhatnagar,et al.  An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes , 2012, J. Optim. Theory Appl..

[59]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[60]  M. Sion On general minimax theorems , 1958 .

[61]  D. Krass,et al.  Percentile performance criteria for limiting average Markov decision processes , 1995, IEEE Trans. Autom. Control..

[62]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[63]  F. Downton Stochastic Approximation , 1969, Nature.

[64]  P. Schweitzer Perturbation theory and finite Markov chains , 1968 .

[65]  Michael C. Fu,et al.  Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control , 2015, ICML.

[66]  J. Tsitsiklis,et al.  Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[67]  M. T. Wasan Stochastic Approximation , 1969 .

[68]  Shalabh Bhatnagar,et al.  Stochastic Recursive Algorithms for Optimization , 2012 .

[69]  J. Spall Adaptive stochastic approximation by the simultaneous perturbation method , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[70]  V. Fabian On Asymptotic Normality in Stochastic Approximation , 1968 .

[71]  Shalabh Bhatnagar,et al.  Threshold Tuning Using Stochastic Optimization for Graded Signal Control , 2012, IEEE Transactions on Vehicular Technology.

[72]  A. Mas-Colell,et al.  Microeconomic Theory , 1995 .

[73]  M. A. Styblinski,et al.  Algorithms and Software Tools for IC Yield Optimization Based on Fundamental Fabrication Parameters , 1986, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[74]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[75]  G. Pflug Stochastic Approximation Methods for Constrained and Unconstrained Systems - Kushner, HJ.; Clark, D.S. , 1980 .

[76]  Shalabh Bhatnagar,et al.  Stochastic approximation algorithms for constrained optimization via simulation , 2011, TOMC.

[77]  Han-Fu Chen,et al.  A Kiefer-Wolfowitz algorithm with randomized differences , 1999, IEEE Trans. Autom. Control..