论文信息 - Risk-Averse Learning by Temporal Difference Methods

Risk-Averse Learning by Temporal Difference Methods

We consider reinforcement learning with performance evaluated by a dynamic risk measure. We construct a projected risk-averse dynamic programming equation and study its properties. Then we propose risk-averse counterparts of the methods of temporal differences and we prove their convergence with probability one. We also perform an empirical study on a complex transportation problem.

Andrzej Ruszczynski | Umit Kose | A. Ruszczynski | Ümit Emre Köse

[1] S. C. Jaquette. A Utility Criterion for Markov Decision Processes , 1976 .

[2] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[3] Vivek S. Borkar,et al. Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..

[4] M. J. Sobel,et al. Discounted MDP's: distribution functions and exponential utility maximization , 1987 .

[5] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[6] David Heath,et al. Coherent multiperiod risk adjusted values and Bellman’s principle , 2007, Ann. Oper. Res..

[7] Steven I. Marcus,et al. Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes , 1999, Autom..

[8] Terrence J. Sejnowski,et al. TD(λ) Converges with Probability 1 , 1994, Machine Learning.

[9] D. White. Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[10] Lukasz Stettner,et al. Risk-Sensitive Control of Discrete-Time Markov Processes with Infinite Horizon , 1999, SIAM J. Control. Optim..

[11] Philippe Artzner,et al. Coherent Measures of Risk , 1999 .

[12] J. Michael Steele,et al. Markov Decision Problems Where Means Bound Variances , 2014, Oper. Res..

[13] John N. Tsitsiklis,et al. Algorithmic aspects of mean-variance optimization in Markov decision processes , 2013, Eur. J. Oper. Res..

[14] Susanne Klöppel,et al. DYNAMIC INDIFFERENCE VALUATION VIA CONVEX RISK MEASURES , 2007 .

[15] J. Hiriart-Urruty,et al. Mean value theorems in nonsmooth analysis , 1980 .

[16] Vivek S. Borkar,et al. A sensitivity formula for risk-sensitive cost and the actor-critic algorithm , 2001, Syst. Control. Lett..

[17] Mohammad Ghavamzadeh,et al. Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[18] S. C. Jaquette. Markov Decision Processes with a New Optimality Criterion: Discrete Time , 1973 .

[19] Andrzej Ruszczynski,et al. Risk measurement and risk-averse control of partially observable discrete-time Markov systems , 2018, Math. Methods Oper. Res..

[20] Shie Mannor,et al. Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[21] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[22] Shie Mannor,et al. Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[23] John N. Tsitsiklis,et al. Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[24] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[25] Darinka Dentcheva,et al. Risk forms: representation, disintegration, and application to partially observable two-stage systems , 2018, Math. Program..

[26] Özlem Çavus,et al. Risk-Averse Control of Undiscounted Transient Markov Models , 2012, SIAM J. Control. Optim..

[27] Ronald A. Howard,et al. Dynamic Programming and Markov Processes , 1960 .

[28] W. Fleming,et al. Optimal long term growth rate of expected utility of wealth , 1999 .

[29] Andrzej Ruszczynski,et al. Risk-averse dynamic programming for Markov decision processes , 2010, Math. Program..

[30] Wlodzimierz Ogryczak,et al. From stochastic dominance to mean-risk models: Semideviations as risk measures , 1999, Eur. J. Oper. Res..

[31] Leonard Rogers,et al. VALUATIONS AND DYNAMIC CONVEX RISK MEASURES , 2007, 0709.0232.

[32] Klaus Obermayer,et al. A Unified Framework for Risk-sensitive Markov Decision Processes with Finite State and Action Spaces , 2011, ArXiv.

[33] Wann-Jiun Ma,et al. Risk-averse sensor planning using distributed policy gradient , 2017, 2017 American Control Conference (ACC).

[34] A. Ruszczynski,et al. Statistical estimation of composite risk functionals and risk optimization problems , 2015, 1504.02658.

[35] Steven D. Levitt,et al. On Modeling Risk in Markov Decision Processes , 2001 .

[36] H. Föllmer,et al. Convex risk measures and the dynamics of their penalty functions , 2006 .

[37] Jing Peng,et al. Incremental multi-step Q-learning , 1994, Machine Learning.

[38] R. Bellman. A Markovian Decision Process , 1957 .

[39] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[40] B. Roorda,et al. COHERENT ACCEPTABILITY MEASURES IN MULTIPERIOD MODELS , 2005 .

[41] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[42] L. L. Wegge,et al. Mean value theorem for convex functions , 1974 .

[43] F. Delbaen,et al. Dynamic Monetary Risk Measures for Bounded Discrete-Time Processes , 2004, math/0410453.

[44] Jerzy A. Filar,et al. Variance-Penalized Markov Decision Processes , 1989, Math. Oper. Res..

[45] Patrick Cheridito,et al. COMPOSITION OF TIME-CONSISTENT DYNAMIC MONETARY RISK MEASURES IN DISCRETE TIME , 2011 .

[46] Shie Mannor,et al. Sequential Decision Making With Coherent Risk , 2017, IEEE Transactions on Automatic Control.

[47] Zhiping Chen,et al. Time-consistent investment policies in Markovian markets: A case of mean–variance analysis , 2014 .

[48] Carlos S. Kubrusly,et al. Stochastic approximation algorithms and applications , 1973, CDC 1973.

[49] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[50] Steven I. Marcus,et al. Dynamic programming with non-convex risk-sensitive measures , 2013, 2013 American Control Conference.

[51] Uriel G. Rothblum,et al. Optimal stopping, exponential utility, and linear programming , 1979, Math. Program..

[52] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[53] Warren B. Powell,et al. Approximate Dynamic Programming for Large-Scale Resource Allocation Problems , 2006 .

[54] G. Pflug,et al. Modeling, Measuring and Managing Risk , 2008 .

[55] A. Ruszczynski,et al. Process-based risk measures and risk-averse control of discrete-time systems , 2014, Math. Program..

[56] Panos M. Pardalos,et al. Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[57] Nicole Bäuerle,et al. More Risk-Sensitive Markov Decision Processes , 2014, Math. Oper. Res..

[58] R. Bellman,et al. Polynomial approximation—a new computational technique in dynamic programming: Allocation processes , 1963 .

[59] Özlem Çavus,et al. Computational Methods for Risk-Averse Undiscounted Transient Markov Models , 2014, Oper. Res..

[60] Peter Dayan,et al. The convergence of TD(λ) for general λ , 1992, Machine Learning.

[61] Anna Jaskiewicz,et al. Persistently Optimal Policies in Stochastic Dynamic Programming with Generalized Discounting , 2013, Math. Oper. Res..

[62] Alexander Shapiro,et al. Conditional Risk Mappings , 2005, Math. Oper. Res..

[63] Alexander Shapiro,et al. Optimization of Convex Risk Functions , 2006, Math. Oper. Res..

[64] A. Ruszczynski,et al. Stochastic approximation method with gradient averaging for unconstrained problems , 1983 .

[65] U. Rieder,et al. Markov Decision Processes , 2010 .

[66] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[67] Frank Riedel,et al. Dynamic Coherent Risk Measures , 2003 .

[68] Daniel Hernández-Hernández,et al. Risk sensitive control of finite state Markov chains in discrete time, with applications to portfolio management , 1999, Math. Methods Oper. Res..

[69] Daniel Hernández-Hernández,et al. Risk Sensitive Markov Decision Processes , 1997 .

[70] W. A. Clark,et al. Simulation of self-organizing systems by digital computer , 1954, Trans. IRE Prof. Group Inf. Theory.

[71] Gavin Adrian Rummery. Problem solving with reinforcement learning , 1995 .