论文信息 - Approximating Action-Value Functions: Addressing Issues of Dynamic Range

Approximating Action-Value Functions: Addressing Issues of Dynamic Range

Abstract : Function approximation is necessary when applying RL to either Markov decision processes (MDPs) or semi-Markov decision processes (SMDPs) with very large state spaces. An often overlooked issue in approximating Q-functions in either framework arises when an action value update in a given state causes a large policy change in other states. Another way of stating this is to say that a small change in the Q-function results in a large change in the implied greedy policy. We call this sensitivity to changes in the Q-function the dynamic range problem and suggest that it may result in greatly increasing the number of training updates required to accurately approximate the optimal policy. We demonstrate that Advantage Learning solves the dynamic range problem in both frameworks and is more robust than some other RL algorithms on these problems. For an MDP, the Advantage Learning algorithm addresses this issue by re-scaling the dynamic range of action values within each state by a constant. For SMDPs the scaling constant can vary for each action.

Mance E. Harmon

[1] Leemon C. Baird,et al. Residual advantage learning applied to a differential game , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[2] Mark Harmon. Multi-player residual advantage learning with general function , 1996 .

[3] Marios M. Polycarpou,et al. An analytical framework for local feedforward networks , 1998, IEEE Trans. Neural Networks.

[4] A. Harry Klopf,et al. Reinforcement Learning: An Alternative Approach to Machine Intelligence , 1996 .

[5] Marios M. Polycarpou,et al. An analytical framework for local feedforward networks , 1996, Proceedings of the 1996 IEEE International Symposium on Intelligent Control.

[6] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[7] A. Harry Klopf,et al. Advantage Updating Applied to a Differrential Game , 1994, NIPS.

[8] A. Harry Klopf,et al. Reinforcement Learning Applied to a Differential Game , 1995, Adapt. Behav..

[9] James S. Albus,et al. Brains, behavior, and robotics , 1981 .

[10] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..