论文信息 - Policy Gradient Reinforcement Learning with Environmental Dynamics and Action-Values in Policies

Policy Gradient Reinforcement Learning with Environmental Dynamics and Action-Values in Policies

The knowledge concerning an agent's policies consists of two types: the environmental dynamics for defining state transitions around the agent, and the behavior knowledge for solving a given task. However, these two types of information, which are usually combined into statevalue or action-value functions, are learned together by conventional reinforcement learning. If they are separated and learned independently, either might be reused in other tasks or environments. In our previous work, we presented learning rules using policy gradients with an objective function, which consists of two types of parameters representing environmental dynamics and behavior knowledge, to separate the learning for each type. In such a learning framework, state-values were used as an example of the set of parameters corresponding to behavior knowledge. By the simulation results on a pursuit problem, our method properly learned hunter-agent policies and reused either bit of knowledge. In this paper, we adopt action-values as a set of parameters in the objective function instead of state-values and present learning rules for the function. Simulation results on the same pursuit problem as in our previous work show that such parameters and learning rules are also useful.

Harukazu Igarashi | Seiji Ishihara | H. Igarashi | S. Ishihara

[1] Kee-Eung Kim,et al. Learning to Cooperate via Policy Search , 2000, UAI.

[2] Harukazu Igarashi,et al. Behavior Learning Based on a Policy Gradient Method: Separation of Environmental Dynamics and State Values in Policies , 2008, PRICAI.

[3] Harukazu Igarashi,et al. Applying the policy gradient method to behavior learning in multiagent systems: The pursuit problem , 2006 .

[4] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[5] Andrew G. Barto,et al. Reinforcement learning , 1998 .

[6] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[7] Shigenobu Kobayashi,et al. Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward , 1995, ICML.

[8] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[9] Masaomi Kimura,et al. Reinforcement Learning in Non-Markov Decision Processes: Statistical Properties of Characteristic Eligibility , 2008 .

[10] Richard S. Sutton,et al. Reinforcement Learning , 1992, Handbook of Machine Learning.

[11] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[12] Andrew W. Moore,et al. Gradient Descent for General Reinforcement Learning , 1998, NIPS.