Landmark Based Reward Shaping in Reinforcement Learning with Hidden States

While most of the work on reward shaping focuses on fully observable problems, there are very few studies that couple reward shaping with partial observability. Moreover, for problems with hidden states, where there is no prior information about the underlying states, reward shaping opportunities are unexplored. In this paper, we show that landmarks can be used to shape the rewards in reinforcement learning with hidden states. Proposed approach is empirically shown to improve the learning performance in terms of speed and quality.

[1]  M. Grzes,et al.  Plan-based reward shaping for reinforcement learning , 2008, 2008 4th International IEEE Conference Intelligent Systems.

[2]  Sam Devlin,et al.  Theoretical considerations of potential-based reward shaping for multi-agent systems , 2011, AAMAS.

[3]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[4]  Yunlong Liu,et al.  Predictive State Representations with State Space Partitioning , 2015, AAMAS.

[5]  Dana H. Ballard,et al.  Learning to perceive and act by trial and error , 1991, Machine Learning.

[6]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[7]  Risto Ritala,et al.  Optimizing gaze direction in a visual navigation task , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[8]  Marek Grzes,et al.  Reward Shaping in Episodic Reinforcement Learning , 2017, AAMAS.

[9]  Yang Gao,et al.  Potential Based Reward Shaping for Hierarchical Reinforcement Learning , 2015, IJCAI.

[10]  Bhaskara Marthi,et al.  Automatic shaping and decomposition of reward functions , 2007, ICML '07.

[11]  Daniel Kudenko,et al.  Online learning of shaping rewards in reinforcement learning , 2010, Neural Networks.

[12]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[13]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[14]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[15]  Lutz Frommberger,et al.  Representing and Selecting Landmarks in Autonomous Learning of Robot Navigation , 2008, ICIRA.

[16]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[17]  Sam Devlin,et al.  Plan-based reward shaping for multi-agent reinforcement learning , 2016, The Knowledge Engineering Review.

[18]  Sam Devlin,et al.  Potential-based reward shaping for finite horizon online POMDP planning , 2015, Autonomous Agents and Multi-Agent Systems.

[19]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[20]  Susanne Biundo-Stephan,et al.  Improving Hierarchical Planning Performance by the Use of Landmarks , 2012, AAAI.

[21]  Patrik Haslum,et al.  Temporal Landmarks: What Must Happen, and When , 2015, ICAPS.

[22]  Ofir Marom,et al.  Belief Reward Shaping in Reinforcement Learning , 2018, AAAI.

[23]  Michael R. James,et al.  SarsaLandmark: an algorithm for learning in POMDPs with landmarks , 2009, AAMAS.