Reinforcement Learning in Non-Markov Environments

Recently, techniques based on reinforcement learning (RL) have been used to build systems that learn to perform non-trivial sequential decision tasks. To date, most of this work has focussed on learning tasks that can be described as Markov decision processes (MDPs). While MDPs are useful for modeling a wide range of control problems, there are important problems that are inherently non-Markov. We refer to these as hidden state tasks since they arise when information relevant to identifying the state of the environment is hidden (or missing) from the agent's internal representation. Two important types of control problems that resist Markov modeling are those in which 1) the system has a high degree of control over the information collected by its sensors (e.g., as in active-vision), or 2) the system has a limited set of sensors that do not always provide adequate information about the current state of the environment. Not surprisingly, traditional RL algorithms, which are based primarily upon the principles of MDPs, perform unreliably on hidden state tasks. This article examines several approaches to extending RL to hidden state tasks. A generalized technique called the Consistent Representation (CR) Method is described. This method uniies such recent approaches as the Lion algorithm, the G-algorithm, and CS-QL; however it is restricted to a class of problems which we call adaptive perception tasks. Several, more general, memory-based algorithms that are not subject to this restriction are also presented. Memory-based algorithms, though quite diierent in detail, share the common feature that each derives its internal representation by combining immediate sensory inputs with internal state which is maintained over time. The relative merits of all of these methods are considered and conditions for their useful application are given.

[1]  W. J. Langford Statistical Methods , 1959, Nature.

[2]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[3]  A. L. Yarbus,et al.  Eye Movements and Vision , 1967, Springer US.

[4]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[5]  S. Ullman Visual routines , 1984, Cognition.

[6]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[7]  John H. Holland,et al.  Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel rule-based systems , 1995 .

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[9]  John H. Holland,et al.  Empirical studies of default hierarchies and sequences of rules in learning classifier systems , 1988 .

[10]  Philip E. Agre,et al.  The dynamic structure of everyday life , 1988 .

[11]  David W. Aha,et al.  Incremental, Instance-Based Learning of Independent and Graded Concept Descriptions , 1989, ML.

[12]  Lashon B. Booker,et al.  Triggered Rule Discovery in Classifier Systems , 1989, ICGA.

[13]  Alexander H. Waibel,et al.  Modular Construction of Time-Delay Neural Networks for Speech Recognition , 1989, Neural Computation.

[14]  Jürgen Schmidhuber,et al.  Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[15]  Paul E. Utgoff,et al.  Explaining Temporal Differences to Create Useful Concepts for Evaluating States , 1990, AAAI.

[16]  Sebastian Thrun,et al.  Planning with an Adaptive World Model , 1990, NIPS.

[17]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[18]  J. Urgen Schmidhuber Making the World Di erentiable: On Using Self-Supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning , 1990 .

[19]  Dana H. Ballard,et al.  Active Perception and Reinforcement Learning , 1990, Neural Computation.

[20]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[21]  Dana H. Ballard,et al.  Animate Vision , 1991, Artif. Intell..

[22]  Long Ji Lin,et al.  Programming Robots Using Reinforcement Learning and Teaching , 1991, AAAI.

[23]  Ming Tan,et al.  Cost-Sensitive Reinforcement Learning for Adaptive Classification and Control , 1991, AAAI.

[24]  Satinder P. Singh,et al.  Transfer of Learning Across Compositions of Sequentail Tasks , 1991, ML.

[25]  Long-Ji Lin,et al.  Self-improving reactive agents: case studies of reinforcement learning frameworks , 1991 .

[26]  Ming Tan,et al.  Cost-sensitive robot learning , 1991 .

[27]  S. Thrun Eecient Exploration in Reinforcement Learning , 1992 .

[28]  Sridhar Mahadevan,et al.  Automatic Programming of Behavior-Based Robots Using Reinforcement Learning , 1991, Artif. Intell..

[29]  Steven Douglas Whitehead,et al.  Reinforcement learning for the adaptive control of perception and action , 1992 .

[30]  Paul E. Utgoff,et al.  A Teaching Method for Reinforcement Learning , 1992, ML.

[31]  Long Lin,et al.  Memory Approaches to Reinforcement Learning in Non-Markovian Domains , 1992 .

[32]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[33]  Jonas Karlsson,et al.  Learning Multiple Goal Behavior via Task Decomposition and Dynamic Policy Merging , 1993 .