DIFFERENTIAL TRAINING OF 1 ROLLOUT POLICIES

We consider the approximate solution of stochastic optimal control problems using a neurodynamic programming/reinforcement learning methodology. We focus on the computation of a rollout policy, which is obtained by a single policy iteration starting from some known base policy and using some form of exact or approximate policy improvement. We indicate that, in a stochastic environment, the popular methods of computing rollout policies are particularly sensitive to simulation and approximation error, and we present more robust alternatives, which aim to estimate relative rather than absolute Q-factor and cost-to-go values. In particular, we propose a method, called differential training , that can be used to obtain an approximation to cost-to-go differences rather than cost-to-go values by using standard methods such as TD(λ) and λ-policy iteration. This method is suitable for recursively generating rollout policies in the context of simulation-based policy iteration methods.