A memory-based reinforcement learning algorithm for partially observable Markovian decision processes

This paper presents a modified version of U-tree (A.K. McCallum, 1996), a memory-based reinforcement learning (RL) algorithm that uses selective perception and short-term memory to handle partially observable Markovian decision processes (POMDP). Conventional RL algorithms rely on a set of pre-defined states to model the environment, even though it can learn the state transitions from experience. U-tree is not only able to do that, it can also build the state model by itself based on raw sensor inputs. This paper enhances U-Treepsilas model generation process. The paper also shows that because of the simplified and yet effective state model generated by U-tree, it is feasible and preferable to adopt the classical dynamic programming (DP) algorithm for average reward MDP to solve some difficult POMDP problems. The new U-tree is tested using a car-driving task with 31,224 world states, with the agent having very limited sensory information and little knowledge about the dynamics of the environment.