Risk-Averse Learning by Temporal Difference Methods

We consider reinforcement learning with performance evaluated by a dynamic risk measure. We construct a projected risk-averse dynamic programming equation and study its properties. Then we propose risk-averse counterparts of the methods of temporal differences and we prove their convergence with probability one. We also perform an empirical study on a complex transportation problem.

[1]  S. C. Jaquette A Utility Criterion for Markov Decision Processes , 1976 .

[2]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[3]  Vivek S. Borkar,et al.  Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..

[4]  M. J. Sobel,et al.  Discounted MDP's: distribution functions and exponential utility maximization , 1987 .

[5]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[6]  David Heath,et al.  Coherent multiperiod risk adjusted values and Bellman’s principle , 2007, Ann. Oper. Res..

[7]  Steven I. Marcus,et al.  Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes , 1999, Autom..

[8]  Terrence J. Sejnowski,et al.  TD(λ) Converges with Probability 1 , 1994, Machine Learning.

[9]  D. White Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[10]  Lukasz Stettner,et al.  Risk-Sensitive Control of Discrete-Time Markov Processes with Infinite Horizon , 1999, SIAM J. Control. Optim..

[11]  Philippe Artzner,et al.  Coherent Measures of Risk , 1999 .

[12]  J. Michael Steele,et al.  Markov Decision Problems Where Means Bound Variances , 2014, Oper. Res..

[13]  John N. Tsitsiklis,et al.  Algorithmic aspects of mean-variance optimization in Markov decision processes , 2013, Eur. J. Oper. Res..

[14]  Susanne Klöppel,et al.  DYNAMIC INDIFFERENCE VALUATION VIA CONVEX RISK MEASURES , 2007 .

[15]  J. Hiriart-Urruty,et al.  Mean value theorems in nonsmooth analysis , 1980 .

[16]  Vivek S. Borkar,et al.  A sensitivity formula for risk-sensitive cost and the actor-critic algorithm , 2001, Syst. Control. Lett..

[17]  Mohammad Ghavamzadeh,et al.  Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[18]  S. C. Jaquette Markov Decision Processes with a New Optimality Criterion: Discrete Time , 1973 .

[19]  Andrzej Ruszczynski,et al.  Risk measurement and risk-averse control of partially observable discrete-time Markov systems , 2018, Math. Methods Oper. Res..

[20]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[21]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[22]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[23]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[24]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[25]  Darinka Dentcheva,et al.  Risk forms: representation, disintegration, and application to partially observable two-stage systems , 2018, Math. Program..

[26]  Özlem Çavus,et al.  Risk-Averse Control of Undiscounted Transient Markov Models , 2012, SIAM J. Control. Optim..

[27]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[28]  W. Fleming,et al.  Optimal long term growth rate of expected utility of wealth , 1999 .

[29]  Andrzej Ruszczynski,et al.  Risk-averse dynamic programming for Markov decision processes , 2010, Math. Program..

[30]  Wlodzimierz Ogryczak,et al.  From stochastic dominance to mean-risk models: Semideviations as risk measures , 1999, Eur. J. Oper. Res..

[31]  Leonard Rogers,et al.  VALUATIONS AND DYNAMIC CONVEX RISK MEASURES , 2007, 0709.0232.

[32]  Klaus Obermayer,et al.  A Unified Framework for Risk-sensitive Markov Decision Processes with Finite State and Action Spaces , 2011, ArXiv.

[33]  Wann-Jiun Ma,et al.  Risk-averse sensor planning using distributed policy gradient , 2017, 2017 American Control Conference (ACC).

[34]  A. Ruszczynski,et al.  Statistical estimation of composite risk functionals and risk optimization problems , 2015, 1504.02658.

[35]  Steven D. Levitt,et al.  On Modeling Risk in Markov Decision Processes , 2001 .

[36]  H. Föllmer,et al.  Convex risk measures and the dynamics of their penalty functions , 2006 .

[37]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[38]  R. Bellman A Markovian Decision Process , 1957 .

[39]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[40]  B. Roorda,et al.  COHERENT ACCEPTABILITY MEASURES IN MULTIPERIOD MODELS , 2005 .

[41]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[42]  L. L. Wegge,et al.  Mean value theorem for convex functions , 1974 .

[43]  F. Delbaen,et al.  Dynamic Monetary Risk Measures for Bounded Discrete-Time Processes , 2004, math/0410453.

[44]  Jerzy A. Filar,et al.  Variance-Penalized Markov Decision Processes , 1989, Math. Oper. Res..

[45]  Patrick Cheridito,et al.  COMPOSITION OF TIME-CONSISTENT DYNAMIC MONETARY RISK MEASURES IN DISCRETE TIME , 2011 .

[46]  Shie Mannor,et al.  Sequential Decision Making With Coherent Risk , 2017, IEEE Transactions on Automatic Control.

[47]  Zhiping Chen,et al.  Time-consistent investment policies in Markovian markets: A case of mean–variance analysis , 2014 .

[48]  Carlos S. Kubrusly,et al.  Stochastic approximation algorithms and applications , 1973, CDC 1973.

[49]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[50]  Steven I. Marcus,et al.  Dynamic programming with non-convex risk-sensitive measures , 2013, 2013 American Control Conference.

[51]  Uriel G. Rothblum,et al.  Optimal stopping, exponential utility, and linear programming , 1979, Math. Program..

[52]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[53]  Warren B. Powell,et al.  Approximate Dynamic Programming for Large-Scale Resource Allocation Problems , 2006 .

[54]  G. Pflug,et al.  Modeling, Measuring and Managing Risk , 2008 .

[55]  A. Ruszczynski,et al.  Process-based risk measures and risk-averse control of discrete-time systems , 2014, Math. Program..

[56]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[57]  Nicole Bäuerle,et al.  More Risk-Sensitive Markov Decision Processes , 2014, Math. Oper. Res..

[58]  R. Bellman,et al.  Polynomial approximation—a new computational technique in dynamic programming: Allocation processes , 1963 .

[59]  Özlem Çavus,et al.  Computational Methods for Risk-Averse Undiscounted Transient Markov Models , 2014, Oper. Res..

[60]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[61]  Anna Jaskiewicz,et al.  Persistently Optimal Policies in Stochastic Dynamic Programming with Generalized Discounting , 2013, Math. Oper. Res..

[62]  Alexander Shapiro,et al.  Conditional Risk Mappings , 2005, Math. Oper. Res..

[63]  Alexander Shapiro,et al.  Optimization of Convex Risk Functions , 2006, Math. Oper. Res..

[64]  A. Ruszczynski,et al.  Stochastic approximation method with gradient averaging for unconstrained problems , 1983 .

[65]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[66]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[67]  Frank Riedel,et al.  Dynamic Coherent Risk Measures , 2003 .

[68]  Daniel Hernández-Hernández,et al.  Risk sensitive control of finite state Markov chains in discrete time, with applications to portfolio management , 1999, Math. Methods Oper. Res..

[69]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[70]  W. A. Clark,et al.  Simulation of self-organizing systems by digital computer , 1954, Trans. IRE Prof. Group Inf. Theory.

[71]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .