The Optimal Reward Baseline for Policy-Gradient Reinforcement Learning

Although policy gradient reinforcement learning (PGRL) has good convergence properties, the variance of policy gradient estimation in existing PGRL algorithms is usually large, which becomes a significant problem for policy gradient algorithms in theory and in practice. This paper proposes a new policy gradient algorithm with reward baselines——Istate Grbp. The Istate Grbp algorithm is an extension of the Istate GPOMDP algorithm by introducing reward baselines to reduce the variance in policy gradient estimation. It is proved that adding a reward baseline in previous Istate GPOMDP does not influence the bias of policy gradient estimation, and the optimal reward baseline to minimize the variance is also derived, which is the average of the observed rewards. The experimental results on a typical POMDP problem show that the variance of Istate Grbp is much smaller than previous Istate GPOMDP and the learning efficiency and convergence speed are both improved.