暂无分享,去创建一个
In tabular case, when the reward and environment dynamics are known, policy evaluation can be written as $\bm{V}_{\bm{\pi}} = (I - \gamma P_{\bm{\pi}})^{-1} \bm{r}_{\bm{\pi}}$, where $P_{\bm{\pi}}$ is the state transition matrix given policy ${\bm{\pi}}$ and $\bm{r}_{\bm{\pi}}$ is the reward signal given ${\bm{\pi}}$. What annoys us is that $P_{\bm{\pi}}$ and $\bm{r}_{\bm{\pi}}$ are both mixed with ${\bm{\pi}}$, which means every time when we update ${\bm{\pi}}$, they will change together. In this paper, we leverage the notation from \cite{wang2007dual} to disentangle ${\bm{\pi}}$ and environment dynamics which makes optimization over policy more straightforward. We show that policy gradient theorem \cite{sutton2018reinforcement} and TRPO \cite{schulman2015trust} can be put into a more general framework and such notation has good potential to be extended to model-based reinforcement learning.
[1] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.
[2] Tao Wang,et al. Dual Representations for Dynamic Programming and Reinforcement Learning , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.
[3] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[4] Elizabeth L. Wilmer,et al. Markov Chains and Mixing Times , 2008 .