An improvement of policy gradient estimation algorithms

In this paper, we discuss the problem of the sample-path-based (on-line) performance gradient estimation for Markov systems. The existing on-line performance gradient estimation algorithms generally require a standard importance sampling assumption. When the assumption does not hold, these algorithms may lead to poor estimates for the gradients. We show that this assumption can be relaxed. We propose a few algorithms that provide performance gradient estimates for systems that do not satisfy the assumption. Simulation examples are given to illustrate the accuracy of the estimates.