SarsaLandmark: an algorithm for learning in POMDPs with landmarks

Reinforcement learning algorithms that use eligibility traces, such as Sarsa(λ), have been empirically shown to be effective in learning good estimated-state-based policies in partially observable Markov decision processes (POMDPs). Nevertheless, one can construct counterexamples, problems in which Sarsa(λ < 1) fails to find a good policy even though one exists. Despite this, these algorithms remain of great interest because alternative approaches to learning in POMDPs based on approximating belief-states do not scale. In this paper we present SarsaLandmark, an algorithm for learning in POMDPs with "landmark" states (most man-made and many natural environments have landmarks). SarsaLandmark simultaneously preserves the advantages offered by eligibility traces and fixes the cause of the failure of Sarsa(λ) on the motivating counterexamples. We present a theoretical analysis of SarsaLandmark for the policy evaluation problem and present empirical results on a few learning control problems.

[1]  Michael Kearns,et al.  Bias-Variance Error Bounds for Temporal Difference Updates , 2000, COLT.

[2]  Theodore J. Perkins,et al.  On the Existence of Fixed Points for Q-Learning and Sarsa in Partially Observable Domains , 2002, ICML.

[3]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[4]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[7]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[8]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[9]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[10]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[11]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[12]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[13]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.