We consider the approximate solution of stochastic optimal control problems using a neurodynamic programming/reinforcement learning methodology. We focus on the computation of a rollout policy, which is obtained by a single policy iteration starting from some known base policy and using some form of exact or approximate policy improvement. We indicate that, in a stochastic environment, the popular methods of computing rollout policies are particularly sensitive to simulation and approximation error, and we present more robust alternatives, which aim to estimate relative rather than absolute Q-factor and cost-to-go values. In particular, we propose a method, called differential training , that can be used to obtain an approximation to cost-to-go differences rather than cost-to-go values by using standard methods such as TD(λ) and λ-policy iteration. This method is suitable for recursively generating rollout policies in the context of simulation-based policy iteration methods.
[1]
Paul Glasserman,et al.
Gradient Estimation Via Perturbation Analysis
,
1990
.
[2]
Edwin K. P. Chong,et al.
Discrete event systems: Modeling and performance analysis
,
1994,
Discret. Event Dyn. Syst..
[3]
A. Harry Klopf,et al.
Advantage Updating Applied to a Differrential Game
,
1994,
NIPS.
[4]
Andrew G. Barto,et al.
Learning to Act Using Real-Time Dynamic Programming
,
1995,
Artif. Intell..
[5]
Gerald Tesauro,et al.
On-line Policy Improvement using Monte-Carlo Search
,
1996,
NIPS.
[6]
John N. Tsitsiklis,et al.
Analysis of Temporal-Diffference Learning with Function Approximation
,
1996,
NIPS.
[7]
John N. Tsitsiklis,et al.
Rollout Algorithms for Combinatorial Optimization
,
1997,
J. Heuristics.
[8]
John N. Tsitsiklis,et al.
Asynchronous Stochastic Approximation and Q-Learning
,
1994,
Machine Learning.