Inducing Partially Observable Markov Decision Processes

In the field of reinforcement learning (Sutton and Barto, 1998; Kaelbling et al., 1996), agents interact with an environment to learn how to act to maximize reward. Two different kinds of environment models dominate the literature—Markov Decision Processes (Puterman, 1994; Littman et al., 1995), or MDPs, and POMDPs, their Partially Observable counterpart (White, 1991; Kaelbling et al., 1998). Both consist of a Markovian state space in which state transitions and immediate rewards are influenced by the action choices of the agent. The difference between the two is that the state is directly observed by the agent in MDPs whereas agents in POMDP environments are only given indirect access to the state via “observations”. This small change to the definition of the model makes a huge difference for the difficulty of the problems of learning and planning. Whereas computing a plan that maximizes reward takes polynomial time in the size of the state space in MDPs (Papadimitriou and Tsitsiklis, 1987), determining the optimal first action to take in a POMDP is undecidable (Madani et al., 2003). The learning problem is not as well studied, but algorithms for learning to approximately optimize an MDP with a polynomial amount of experience have been created (Kearns and Singh, 2002; Strehl et al., 2009), whereas similar results for POMDPs remain elusive. A key observation for learning to obtain near optimal reward in an MDP is that inducing a highly accurate model of an MDP from experience can be a simple matter of counting observed transitions between states under the influence of the selected actions. The critical quantities are all directly observed and simple statistics are enough to reveal their relationships. Learning in more complex MDPs is a matter of properly generalizing the observed experience to novel states (Atkeson et al., 1997) and can often be done provably efficiently (Li et al., 2011). Inducing a POMDP, however, appears to involve a difficult “chicken-and-egg” problem. If a POMDP’s structure is known, it is possible to keep track of the likelihood of occupying each Markovian state at each moment of time while selecting actions and making observations, thus enabling the POMDP’s structure to be learned. But, if the POMDP’s structure is not known in advance, this information is not available, making it unclear how to collect the necessary statistics. Thus, in many ways, the POMDP induction problem has elements in common with grammatical induction. The hidden states, like non-terminals, are important for explaining the structure of observed sequences, but cannot be directly detected. Several different strategies have been used by researchers attempting to induce POMDP models in the context of reinforcement learning. The first work that explicitly introduced

[1]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[2]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3]  Andrew McCallum,et al.  Instance-Based State Identification for Reinforcement Learning , 1994, NIPS.

[4]  Xiaoping Chen,et al.  Covering Number as a Complexity Measure for POMDP Planning and Learning , 2012, AAAI.

[5]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[6]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[7]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[8]  Andrew W. Moore,et al.  Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[9]  Peter Stone,et al.  Learning Predictive State Representations , 2003, ICML.

[10]  Thomas J. Walsh,et al.  A Multiple Representation Approach to Learning Dynamical Systems , 2007, AAAI Fall Symposium: Computational Approaches to Representation Change during Learning and Development.

[11]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[12]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[13]  Andrew McCallum,et al.  Overcoming Incomplete Perception with Utile Distinction Memory , 1993, ICML.

[14]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[15]  Michael R. James,et al.  Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[16]  Anne Condon,et al.  On the undecidability of probabilistic planning and related stochastic optimization problems , 2003, Artif. Intell..

[17]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[18]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[19]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[20]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.