Generating Memoryless Policies Faster Using Automatic Temporal Abstractions for Reinforcement Learning with Hidden State

Reinforcement learning with eligibility traces has been an effective way to solve problems with hidden state. Under certain conditions, it succeeds to build up a memoryless optimal policy over observations. Automatic generation of temporal abstractions, on the other hand, provides ways to extract and make use of useful sub-policies during reinforcement learning for a fully observable problem setting, so that the agent shall not need to repeatedly learn the same skill. One of the recent automatic abstraction techniques is the extended sequence tree method. We propose a novel way to bring together the extended sequence tree method and reinforcement learning for problems with hidden state. We expand the extended sequence tree method with a mechanism that helps the abstraction procedure to get rid of adverse effects of perceptual aliasing, letting the agent to make use of the remaining useful abstractions. Effectiveness of the method is shown empirically via experimentation on some benchmark problems.

[1]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[2]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[3]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[4]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[6]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[7]  Andrew G. Barto,et al.  Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.

[8]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[9]  Takeshi Yoshikawa,et al.  An Acquiring Method of Macro-Actions in Reinforcement Learning , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[10]  Shie Mannor,et al.  Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[11]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[12]  Mark D. Pendrith,et al.  An Analysis of Direct Reinforcement Learning in Non-Markovian Domains , 1998, ICML.

[13]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[14]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[15]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[16]  Takashi Komeda,et al.  Solving POMDPs with Automatic Discovery of Subgoals , 2009 .

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[19]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[20]  Carla E. Brodley,et al.  Proceedings of the twenty-first international conference on Machine learning , 2004, International Conference on Machine Learning.

[21]  Bernhard Hengst,et al.  Discovering Hierarchy in Reinforcement Learning with HEXQ , 2002, ICML.

[22]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[23]  Amy McGovern,et al.  AcQuire-macros: An Algorithm for Automatically Learning Macro-actions , 1998 .

[24]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[25]  Leslie Pack Kaelbling,et al.  Approximate Planning in POMDPs with Macro-Actions , 2003, NIPS.

[26]  Paul A. Crook Learning in a state of confusion : employing active perception and reinforcement learning in partially observable worlds , 2007 .

[27]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[28]  Reda Alhajj,et al.  Improving reinforcement learning by using sequence trees , 2010, Machine Learning.

[29]  Faruk Polat,et al.  Abstraction in Model Based Partially Observable Reinforcement Learning Using Extended Sequence Trees , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[30]  Groupe Pdmia Markov Decision Processes In Artificial Intelligence , 2009 .

[31]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[32]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[33]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.