论文信息 - The Influence of Reward on the Speed of Reinforcement Learning: An Analysis of Shaping

The Influence of Reward on the Speed of Reinforcement Learning: An Analysis of Shaping

Shaping can be an effective method for improving the learning rate in reinforcement systems. Previously, shaping has been heuristically motivated and implemented. We provide a formal structure with which to interpret the improvement afforded by shaping rewards. Central to our model is the idea of a reward horizon, which focuses exploration on an MDP's critical region, a subset of states with the property that any policy that performs well on the critical region also performs well on the MDP. We provide a simple algorithm and prove that its learning time is polynomial in the size of the critical region and, crucially, independent of the size of the MDP. This identifies low reward horizons with easy-to-learn MDPs. Shaping rewards, which encode our prior knowledge about the relative merits of decisions, can be seen as artificially reducing the MDP's natural reward horizon. We demonstrate empirically the effects of using shaping to reduce the reward horizon.

Gerald DeJong | Adam Laud | G. DeJong | A. Laud

[1] M. Kendall. Probability and Statistical Inference , 1956, Nature.

[2] P. J. Green,et al. Probability and Statistical Inference , 1978 .

[3] Marco Colombetti,et al. Robot shaping: developing situated agents through learning , 1992 .

[4] Maja J. Mataric,et al. Reward Functions for Accelerated Learning , 1994, ICML.

[5] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[6] Preben Alstrøm,et al. Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[7] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[8] Gerald DeJong,et al. Reinforcement Learning and Shaping: Encouraging Intended Behaviors , 2002, ICML.

[9] Gerald Tesauro,et al. Practical issues in temporal difference learning , 1992, Machine Learning.

[10] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.