Automatic landmark discovery for learning agents under partial observability

In the reinforcement learning context, a landmark is a compact information which uniquely couples a state, for problems with hidden states. Landmarks are shown to support finding good memoryless policies for Partially Observable Markov Decision Processes (POMDP) which contain at least one landmark. SarsaLandmark, as an adaptation of Sarsa(λ), is known to promise a better learning performance with the assumption that all landmarks of the problem are known in advance.In this paper, we propose a framework built upon SarsaLandmark, which is able to automatically identify landmarks within the problem during learning without sacrificing quality, and requiring no prior information about the problem structure. For this purpose, the framework fuses SarsaLandmark with a well-known multiple-instance learning algorithm, namely Diverse Density (DD). By further experimentation, we also provide a deeper insight into our concept filtering heuristic to accelerate DD, abbreviated as DDCF (Diverse Density with Concept Filtering), which proves itself to be suitable for POMDPs with landmarks. DDCF outperforms its antecedent in terms of computation speed and solution quality without loss of generality.The methods are empirically shown to be effective via extensive experimentation on a number of known and newly introduced problems with hidden state, and the results are discussed.

[1]  T. Komeda,et al.  Reinforcement learning in non-markovian environments using automatic discovery of subgoals , 2007, SICE Annual Conference 2007.

[2]  Manuela M. Veloso,et al.  TTree: Tree-Based State Generalization with Temporally Abstract Actions , 2002, SARA.

[3]  Patrik Haslum,et al.  Temporal Landmarks: What Must Happen, and When , 2015, ICAPS.

[4]  Christophe Claramunt,et al.  Topological Analysis of Urban Street Networks , 2004 .

[5]  Risto Ritala,et al.  Optimizing gaze direction in a visual navigation task , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[6]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[7]  Lutz Frommberger,et al.  Representing and Selecting Landmarks in Autonomous Learning of Robot Navigation , 2008, ICIRA.

[8]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[9]  Ding Xiao,et al.  Autonomic discovery of subgoals in hierarchical reinforcement learning , 2014 .

[10]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[11]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[12]  Aidong Zhang,et al.  Bridging centrality: graph mining from element level to group level , 2008, KDD.

[13]  Andrew G. Barto,et al.  Behavioral building blocks for autonomous agents: description, identification, and learning , 2008 .

[14]  Jan Peters,et al.  Probabilistic inference for determining options in reinforcement learning , 2016, Machine Learning.

[15]  Manfred Huber,et al.  Subgoal Discovery for Hierarchical Reinforcement Learning Using Learned Policies , 2003 .

[16]  Susanne Biundo-Stephan,et al.  Improving Hierarchical Planning Performance by the Use of Landmarks , 2012, AAAI.

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Michael R. James,et al.  SarsaLandmark: an algorithm for learning in POMDPs with landmarks , 2009, AAMAS.

[19]  Alicia P. Wolfe,et al.  Identifying useful subgoals in reinforcement learning by local graph partitioning , 2005, ICML.

[20]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[21]  Dana H. Ballard,et al.  Learning to perceive and act by trial and error , 1991, Machine Learning.

[22]  Leslie J. Kitchen,et al.  Navigation using natural landmarks , 1999, Robotics Auton. Syst..

[23]  Bruce L. Digney,et al.  Learning hierarchical control structures for multiple tasks and changing environments , 1998 .

[24]  Jörg Hoffmann,et al.  Ordered Landmarks in Planning , 2004, J. Artif. Intell. Res..

[25]  Jean-Claude Latombe,et al.  Motion Planning with Uncertainty: A Landmark Approach , 1995, Artif. Intell..

[26]  Jiming Liu,et al.  Discovering global network communities based on local centralities , 2008, TWEB.

[27]  Faruk Polat,et al.  A Concept Filtering Approach for Diverse Density to Discover Subgoals in Reinforcement Learning , 2017, 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI).

[28]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[29]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[30]  Shie Mannor,et al.  Dynamic abstraction in reinforcement learning via clustering , 2004, ICML.

[31]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[32]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[33]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[34]  Andrew G. Barto,et al.  Causal Graph Based Decomposition of Factored MDPs , 2006, J. Mach. Learn. Res..

[35]  Takeshi Yoshikawa,et al.  An Acquiring Method of Macro-Actions in Reinforcement Learning , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[36]  Andrew G. Barto,et al.  PolicyBlocks: An Algorithm for Creating Useful Macro-Actions in Reinforcement Learning , 2002, ICML.