论文信息 - Reinforcement Learning in Non-Markov Environments

Reinforcement Learning in Non-Markov Environments

Recently, techniques based on reinforcement learning (RL) have been used to build systems that learn to perform non-trivial sequential decision tasks. To date, most of this work has focussed on learning tasks that can be described as Markov decision processes (MDPs). While MDPs are useful for modeling a wide range of control problems, there are important problems that are inherently non-Markov. We refer to these as hidden state tasks since they arise when information relevant to identifying the state of the environment is hidden (or missing) from the agent's internal representation. Two important types of control problems that resist Markov modeling are those in which 1) the system has a high degree of control over the information collected by its sensors (e.g., as in active-vision), or 2) the system has a limited set of sensors that do not always provide adequate information about the current state of the environment. Not surprisingly, traditional RL algorithms, which are based primarily upon the principles of MDPs, perform unreliably on hidden state tasks. This article examines several approaches to extending RL to hidden state tasks. A generalized technique called the Consistent Representation (CR) Method is described. This method uniies such recent approaches as the Lion algorithm, the G-algorithm, and CS-QL; however it is restricted to a class of problems which we call adaptive perception tasks. Several, more general, memory-based algorithms that are not subject to this restriction are also presented. Memory-based algorithms, though quite diierent in detail, share the common feature that each derives its internal representation by combining immediate sensory inputs with internal state which is maintained over time. The relative merits of all of these methods are considered and conditions for their useful application are given.

Longxin Lin | S. Whitehead

[1] W. J. Langford. Statistical Methods , 1959, Nature.

[2] Arthur L. Samuel,et al. Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[3] A. L. Yarbus,et al. Eye Movements and Vision , 1967, Springer US.

[4] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[5] S. Ullman. Visual routines , 1984, Cognition.

[6] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[7] John H. Holland,et al. Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel rule-based systems , 1995 .

[8] Dimitri P. Bertsekas,et al. Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[9] John H. Holland,et al. Empirical studies of default hierarchies and sequences of rules in learning classifier systems , 1988 .

[10] Philip E. Agre,et al. The dynamic structure of everyday life , 1988 .

[11] David W. Aha,et al. Incremental, Instance-Based Learning of Independent and Graded Concept Descriptions , 1989, ML.