Reduced Training Time for Reinforcement Learning with Hidden State

Real robots with real sensors are not omniscient. When a robot's next course of action depends on information that is hidden from the sensors because of problems such as occlusion, restricted range, bounded eld of view and limited attention, we say the robot suuers from the hidden state problem. State identiication techniques use history information to uncover hidden state. Previous approaches to encoding history include: nite state machines Chrisman, 1992; McCallum, 1992a], recurrent neural networks Lin and Mitchell, 1992] and genetic programming with indexed memory Teller, 1994]. A chief disadvantage of all these techniques is their long training time. This paper presents instance-based state identiication, a new approach to hidden state reinforcement learning that learns with much fewer training steps. Noting that gathering world experience is often relatively more expensive than computation and storage, the approach applies instance-based (or \memory-based") learning to history sequences. The rst implementation of this approach, called Nearest Sequence Memory learns with an order of magnitude fewer steps than several previous approaches. A reinforcement learning agent suuers from the hidden state problem if at any time the agent's state representation is missing information needed to determine the next correct action, that is, if the agent's state representation is non-Markovian with respect to actions and utility. The hidden state problem arises as a case of perceptual aliasing: the mapping between states of the world and sensations of the agent is not one-to-one White-head and Ballard, 1991]. If the agent's perceptual system produces the same outputs for two world states in which diierent actions are required, and if the agent's state representation consists only of its percepts, then the agent will fail to choose correct actions. There are many reasons that important features could be hidden from a robot's perception: sensors have noise, limited range and limited eld of view; occlu-sions hide areas from sensing; limited funds and space prevent equipping the robot with all desired sensors; an exhaustible power supply deters the robot from using all sensors all the time; and the robot has limited computational resources for turning raw sensor data into usable percepts. 1.1 Stateless Agents One solution to the hidden state problem is simply to avoid passing through the perceptually aliased states. This is the approach taken in Whitehead's Lion algorithm Whitehead, 1992]. Whenever the agent nds a state that delivers inconsistent reward, it sets that state's utility so low that the policy will …

[1]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[2]  Tom M. Mitchell,et al.  Reinforcement learning with hidden states , 1993 .

[3]  R. A. McCallum First Results with Utile Distinction Memory for Reinforcement Learning , 1992 .

[4]  Astro Teller,et al.  The evolution of mental models , 1994 .

[5]  R. A. McCallum,et al.  Learning with Incomplete Selective Perception , 1993 .

[6]  Andreas Stolcke,et al.  Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[7]  Andrew McCallum,et al.  Using Transitional Proximity for Faster Reinforcement Learning , 1992, ML.

[8]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[9]  Steven Douglas Whitehead,et al.  Reinforcement learning for the adaptive control of perception and action , 1992 .

[10]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[11]  Jeff G. Schneider High Dimension Action Spaces in Robot Skill Learning , 1994, AAAI.

[12]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[13]  Maja J. Matarić,et al.  A Distributed Model for Mobile Robot Environment-Learning and Navigation , 1990 .

[14]  Leslie Pack Kaelbling,et al.  Acting Optimally in Partially Observable Stochastic Domains , 1994, AAAI.

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[17]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[18]  R. A. McCallum First Results with Instance-Based State Identification for Reinforcement Learning , 1994 .

[19]  Peter Dorato,et al.  Dynamic programming and stochastic control , 1978 .

[20]  Long Ji Lin,et al.  Self-improvement Based on Reinforcement Learning, Planning and Teaching , 1991, ML.