Reward Mapping for Transfer in Long-Lived Agents

We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent's behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent's performance relative to other approaches, including an approach that transfers policies.

[1]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[2]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[3]  Shimon Whiteson,et al.  Transfer via inter-task mappings in policy search reinforcement learning , 2007, AAMAS '07.

[4]  Tanaka Fumihide,et al.  Multitask Reinforcement Learning on the Distribution of MDPs , 2003 .

[5]  Jude W. Shavlik,et al.  Policy Transfer via Markov Logic Networks , 2009, ILP.

[6]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[7]  Peter Stone,et al.  Value-Function-Based Transfer for Reinforcement Learning Using Structure Mapping , 2006, AAAI.

[8]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[9]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[10]  Andrew G. Barto,et al.  Autonomous shaping: knowledge transfer in reinforcement learning , 2006, ICML.

[11]  Sriraam Natarajan,et al.  Dynamic preferences in multi-criteria reinforcement learning , 2005, ICML.

[12]  Peter Stone,et al.  Transferring Instances for Model-Based Reinforcement Learning , 2008, ECML/PKDD.

[13]  Richard L. Lewis,et al.  Optimal Rewards versus Leaf-Evaluation Heuristics in Planning Agents , 2011, AAAI.

[14]  Daphne Koller,et al.  Making Rational Decisions Using Adaptive Utility Elicitation , 2000, AAAI/IAAI.

[15]  Andrea Bonarini,et al.  Transfer of samples in batch reinforcement learning , 2008, ICML '08.

[16]  Richard L. Lewis,et al.  Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective , 2010, IEEE Transactions on Autonomous Mental Development.

[17]  P. Bartlett,et al.  Stochastic optimization of controlled partially observable Markov decision processes , 2000, Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No.00CH37187).

[18]  Richard L. Lewis,et al.  Reward Design via Online Gradient Ascent , 2010, NIPS.

[19]  Doina Precup,et al.  Using Options for Knowledge Transfer in Reinforcement Learning , 1999 .

[20]  Christopher G. Atkeson,et al.  A comparison of direct and model-based reinforcement learning , 1997, Proceedings of International Conference on Robotics and Automation.

[21]  Andrew G. Barto,et al.  Building Portable Options: Skill Transfer in Reinforcement Learning , 2007, IJCAI.

[22]  Rafael Castro-Linares,et al.  Trajectory tracking for non-holonomic cars: A linear approach to controlled leader-follower formation , 2010, 49th IEEE Conference on Decision and Control (CDC).