Toward Generalization of Automated Temporal Abstraction to Partially Observable Reinforcement Learning

Temporal abstraction for reinforcement learning (RL) aims to decrease learning time by making use of repeated sub-policy patterns in the learning task. Automatic extraction of abstractions during RL process is difficult but has many challenges such as dealing with the curse of dimensionality. Various studies have explored the subject under the assumption that the problem domain is fully observable by the learning agent. Learning abstractions for partially observable RL is a relatively less explored area. In this paper, we adapt an existing automatic abstraction method, namely extended sequence tree, originally designed for fully observable problems. The modified method covers a certain family of model-based partially observable RL settings. We also introduce belief state discretization methods that can be used with this new abstraction mechanism. The effectiveness of the proposed abstraction method is shown empirically by experimenting on well-known benchmark problems.

[1]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[2]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[3]  R. Bellman Dynamic programming. , 1957, Science.

[4]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[5]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[6]  Thomas Hofmann,et al.  Automated Hierarchy Discovery for Planning in Partially Observable Environments , 2007 .

[7]  Andrew G. Barto,et al.  Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.

[8]  Ron Sun,et al.  Self-segmentation of sequences: automatic formation of hierarchies of sequential behaviors , 2000, IEEE Trans. Syst. Man Cybern. Part B.

[9]  Bernhard Hengst,et al.  Discovering Hierarchy in Reinforcement Learning with HEXQ , 2002, ICML.

[10]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[11]  Stuart J. Russell,et al.  Approximating Optimal Policies for Partially Observable Stochastic Domains , 1995, IJCAI.

[12]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[13]  Amy McGovern,et al.  AcQuire-macros: An Algorithm for Automatically Learning Macro-actions , 1998 .

[14]  Ronald E. Parr,et al.  Hierarchical control and learning for markov decision processes , 1998 .

[15]  William S. Lovejoy,et al.  Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..

[16]  Leslie Pack Kaelbling,et al.  Approximate Planning in POMDPs with Macro-Actions , 2003, NIPS.

[17]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[18]  Reda Alhajj,et al.  Improving reinforcement learning by using sequence trees , 2010, Machine Learning.

[19]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[20]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[21]  Eric A. Hansen,et al.  An Improved Grid-Based Approximation Algorithm for POMDPs , 2001, IJCAI.

[22]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[23]  A. Cassandra,et al.  Exact and approximate algorithms for partially observable markov decision processes , 1998 .

[24]  R. Andrew McCallum,et al.  Hidden state and reinforcement learning with instance-based state identification , 1996, IEEE Trans. Syst. Man Cybern. Part B.

[25]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[26]  Geoffrey J. Gordon,et al.  Finding Approximate POMDP solutions Through Belief Compression , 2011, J. Artif. Intell. Res..

[27]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  Joelle Pineau,et al.  Tractable planning under uncertainty: exploiting structure , 2004 .

[30]  Reid G. Simmons,et al.  Heuristic Search Value Iteration for POMDPs , 2004, UAI.

[31]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[32]  T. Komeda,et al.  Reinforcement learning in non-markovian environments using automatic discovery of subgoals , 2007, SICE Annual Conference 2007.

[33]  Reda Alhajj,et al.  Positive Impact of State Similarity on Reinforcement Learning Performance , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[34]  Faruk Polat,et al.  Abstraction in Model Based Partially Observable Reinforcement Learning Using Extended Sequence Trees , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[35]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[36]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[37]  Leslie Pack Kaelbling,et al.  Acting under uncertainty: discrete Bayesian models for mobile-robot navigation , 1996, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS '96.