Apprenticeship Learning for Initial Value Functions in Reinforcement Learning

Reinforcement Learning has had spectacular successes over the last several decades. While meant to require less human input than supervised learning, reinforcement learning can be substantially accelerated with a priori available domain expertise. The ways of providing human knowledge to a reinforcement learning agent vary from crafting state features to initial policy design to initial value function design. We chose the latter and propose a novel approach for acquiring a high-quality initial value function via apprenticeship learning. This approach works well in domain when a body of expert data are available. Our apprentice reinforcement learning (ARL) agent uses dynamic programming to compute values for the states visited by the expert. A Laplacian regularizer is then engaged to extrapolate these onto the entire state space. The result of this process is a high-quality initial value function to be further refined by any value-function based reinforcement learning method. In a grid world domain, ARL was able to speed up TD( ) learning method by a factor of two from a single observed expert’s trace.

[1]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[2]  David H. Ackley,et al.  Interactions between learning and evolution , 1991 .

[3]  Richard E. Korf,et al.  Finding Optimal Solutions to the Twenty-Four Puzzle , 1996, AAAI/IAAI, Vol. 2.

[4]  Toru Ishida,et al.  Controlling the learning process of real-time heuristic search , 2003, Artif. Intell..

[5]  Blai Bonet,et al.  A Robust and Fast Action Selection Mechanism for Planning , 1997, AAAI/IAAI.

[6]  Russell Greiner,et al.  Focus of Attention in Sequential Decision Making , 2004 .

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Vadim Bulitko,et al.  Machine Learning for Adaptive Image Interpretation , 2004, AAAI.

[9]  Claude Sammut,et al.  Learning to Fly , 1992, ML.

[10]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[11]  Jonathan Schaeffer,et al.  Temporal Difference Learning Applied to a High-Performance Game-Playing Program , 2001, IJCAI.

[12]  Sven Koenig,et al.  A comparison of fast search methods for real-time situated agents , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[13]  Gerald DeJong,et al.  The Influence of Reward on the Speed of Reinforcement Learning: An Analysis of Shaping , 2003, ICML.

[14]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[15]  Robert C. Holte,et al.  Searching With Abstractions: A Unifying Framework and New High-Performance Algorithm 1 , 1994 .

[16]  Richard E. Korf,et al.  Finding Optimal Solutions to Rubik's Cube Using Pattern Databases , 1997, AAAI/IAAI.

[17]  Richard E. Korf,et al.  Real-Time Heuristic Search , 1990, Artif. Intell..

[18]  Richard E. Korf,et al.  Disjoint pattern database heuristics , 2002, Artif. Intell..

[19]  Eric Wiewiora,et al.  Potential-Based Shaping and Q-Value Initialization are Equivalent , 2003, J. Artif. Intell. Res..

[20]  Lin Zhang,et al.  Decision-Theoretic Military Operations Planning , 2004, ICAPS.

[21]  Vadim Bulitko,et al.  Batch Reinforcement Learning with State Importance , 2004, ECML.

[22]  Peter Stone,et al.  Machine Learning for Fast Quadrupedal Locomotion , 2004, AAAI.

[23]  David C. Wilkins,et al.  Qualitative simulation of temporal concurrent processes using Time Interval Petri Nets , 2003, Artif. Intell..