论文信息 - Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks

Plan-Based Relaxed Reward Shaping for Goal-Directed Tasks

In high-dimensional state spaces, the usefulness of Reinforcement Learning (RL) is limited by the problem of exploration. This issue has been addressed using potential-based reward shaping (PB-RS) previously. In the present work, we introduce Final-Volume-Preserving Reward Shaping (FV-RS). FV-RS relaxes the strict optimality guarantees of PB-RS to a guarantee of preserved long-term behavior. Being less restrictive, FV-RS allows for reward shaping functions that are even better suited for improving the sample efficiency of RL algorithms. In particular, we consider settings in which the agent has access to an approximate plan. Here, we use examples of simulated robotic manipulation tasks to demonstrate that plan-based FV-RS can indeed significantly improve the sample efficiency of RL over plan-based PB-RS.

Marc Toussaint | Ozgur S. Oguz | Ingmar Schubert | Marc Toussaint | Ingmar Schubert

[1] Eric Wiewiora,et al. Potential-Based Shaping and Q-Value Initialization are Equivalent , 2003, J. Artif. Intell. Res..

[2] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[3] Ana Paiva,et al. Learning by appraising: an emotion-based approach to intrinsic reward design , 2014, Adapt. Behav..

[4] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[5] M. Grzes,et al. Plan-based reward shaping for reinforcement learning , 2008, 2008 4th International IEEE Conference Intelligent Systems.

[6] R. Bellman. A Markovian Decision Process , 1957 .

[7] Sam Devlin,et al. Dynamic potential-based reward shaping , 2012, AAMAS.

[8] Sonia Chernova,et al. Reinforcement Learning from Demonstration through Shaping , 2015, IJCAI.

[9] Sonia Chernova,et al. Integrating reinforcement learning with human demonstrations of varying ability , 2011, AAMAS.

[10] Tim Salimans,et al. Learning Montezuma's Revenge from a Single Demonstration , 2018, ArXiv.

[11] Ana Paiva,et al. Emotion-Based Intrinsic Motivation for Reinforcement Learning Agents , 2011, ACII.

[12] Sonia Chernova,et al. Recent Advances in Robot Learning from Demonstration , 2020, Annu. Rev. Control. Robotics Auton. Syst..

[13] Richard L. Lewis,et al. Reward Design via Online Gradient Ascent , 2010, NIPS.

[14] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[15] Andrea Lockerd Thomaz,et al. Reinforcement Learning with Human Teachers: Evidence of Feedback and Guidance with Implications for Learning Performance , 2006, AAAI.

[16] Sonia Chernova,et al. Learning from Demonstration for Shaping through Inverse Reinforcement Learning , 2016, AAMAS.

[17] Garrison W. Cottrell,et al. Principled Methods for Advising Reinforcement Learning Agents , 2003, ICML.

[18] Peter Stone,et al. Combining manual feedback with subsequent MDP reward signals for reinforcement learning , 2010, AAMAS.

[19] Marek Grzes,et al. Reward Shaping in Episodic Reinforcement Learning , 2017, AAMAS.

[20] Preben Alstrøm,et al. Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[21] Sam Devlin,et al. Plan-based reward shaping for multi-agent reinforcement learning , 2016, The Knowledge Engineering Review.

[22] Maja J. Mataric,et al. Reward Functions for Accelerated Learning , 1994, ICML.

[23] G. Baldassarre,et al. Evolving internal reinforcers for an intrinsically motivated reinforcement-learning robot , 2007, 2007 IEEE 6th International Conference on Development and Learning.

[24] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[25] Pierre-Yves Oudeyer,et al. Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[26] Nuttapong Chentanez,et al. Intrinsically Motivated Learning of Hierarchical Collections of Skills , 2004 .

[27] Tom Schaul,et al. Deep Q-learning From Demonstrations , 2017, AAAI.

[28] Daniel Kudenko,et al. Using plan-based reward shaping to learn strategies in StarCraft: Broodwar , 2013, 2013 IEEE Conference on Computational Inteligence in Games (CIG).

[29] Marcin Andrychowicz,et al. Hindsight Experience Replay , 2017, NIPS.

[30] Brett Browning,et al. A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[31] Sam Devlin,et al. Overcoming incorrect knowledge in plan-based reward shaping , 2016, The Knowledge Engineering Review.