POLICY EVALUATION WITH STOCHASTIC GRADIENT ESTIMATION TECHNIQUES

In this paper, we consider policy evaluation in a finite-horizon setting with continuous state variables. The Bellman equation represents the value function as a conditional expectation, which can be further transformed into a ratio of two stochastic gradients. By using the finite difference method and the generalized likelihood ratio method, we propose new estimators for policy evaluation and show how the value of any given state can be estimated using sample paths starting from various other states.

[1]  Tikhon Jelvis,et al.  Foundations of Reinforcement Learning with Applications in Finance , 2022 .

[2]  M. Fu,et al.  Estimating a Conditional Expectation with the Generalized Likelihood Ratio Method , 2021, 2021 Winter Simulation Conference (WSC).

[3]  Martin Takác,et al.  A Deep Q-Network for the Beer Game: Deep Reinforcement Learning for Inventory Optimization , 2021, Manuf. Serv. Oper. Manag..

[4]  Catherine Daveloose,et al.  Representations for conditional expectations and applications to pricing and hedging of financial products in Lévy and jump-diffusion setting , 2019, Stochastic Analysis and Applications.

[5]  Jalaj Bhandari,et al.  A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[6]  Michael C. Fu,et al.  A New Unbiased Stochastic Derivative Estimator for Discontinuous Sample Performances with Structural Parameters , 2018, Oper. Res..

[7]  Philip S. Thomas,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-Dependent Baselines , 2017, ArXiv.

[8]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[9]  Francis A. Longstaff,et al.  Valuing American Options by Simulation: A Simple Least-Squares Approach , 2001 .

[10]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[11]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[12]  Ralph Neuneier,et al.  Optimal Asset Allocation using Adaptive Dynamic Programming , 1995, NIPS.

[13]  Gang George Yin,et al.  Budget-Dependent Convergence Rate of Stochastic Approximation , 1995, SIAM J. Optim..

[14]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[15]  Paul Glasserman,et al.  Gradient Estimation Via Perturbation Analysis , 1990 .

[16]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[17]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[18]  Y. Ho,et al.  Perturbation analysis and optimization of queueing networks , 1982, 1982 21st IEEE Conference on Decision and Control.