论文信息 - Action Refinement in Reinforcement Learning by Probability Smoothing

Action Refinement in Reinforcement Learning by Probability Smoothing

In many reinforcement learning applications, the set of possible actions can be partitioned by the programmer into subsets of similar actions. This paper presents a technique for exploiting this form of prior information to speed up model-based reinforcement learning. We call it an action refinement method, because it treats each subset of similar actions as a single “abstract” action early in the learning process and then later “refines” the abstract action into individual actions as more experience is gathered. Our method estimates the transition probabilities P (s′|s, a) for an action a by combining the results of executions of action a with executions of other actions in the same subset of similar actions. This is a form of “smoothing” of the probability estimates that trades increased bias for reduced variance. The paper derives a formula for optimal smoothing which shows that the degree of smoothing should decrease as the amount of data increases. Experiments show that probability smoothing is better than two simpler action refinement methods on a synthetic maze problem. Action refinement is most useful in problems, such as robotics, where training experiences are expensive.

[1] Peter Norvig,et al. Artificial Intelligence: A Modern Approach , 1995 .

[2] Andrew W. Moore,et al. Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[3] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4] Dídac Busquets,et al. Reinforcement learning for landmark-based robot navigation , 2002, AAMAS '02.

[5] Stuart J. Russell,et al. Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[6] Michael L. Littman,et al. Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach , 1993, NIPS.

[7] Thomas G. Dietterich. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[8] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..