Reinforcement Learning through Global Stochastic Search in N-MDPs

Reinforcement Learning (RL) in either fully or partially observable domains usually poses a requirement on the knowledge representation in order to be sound: the underlying stochastic process must be Markovian. In many applications, including those involving interactions between multiple agents (e.g., humans and robots), sources of uncertainty affect rewards and transition dynamics in such a way that a Markovian representation would be computationally very expensive. An alternative formulation of the decision problem involves partially specified behaviors with choice points. While this reduces the complexity of the policy space that must be explored - something that is crucial for realistic autonomous agents that must bound search time - it does render the domain Non-Markovian. In this paper, we present a novel algorithm for reinforcement learning in Non-Markovian domains. Our algorithm, Stochastic Search Monte Carlo, performs a global stochastic search in policy space, shaping the distribution from which the next policy is selected by estimating an upper bound on the value of each action. We experimentally show how, in challenging domains for RL, high-level decisions in Non-Markovian processes can lead to a behavior that is at least as good as the one learned by traditional algorithms, and can be achieved with significantly fewer samples.

[1]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[2]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[3]  Daniel Polani,et al.  Learning RoboCup-Keepaway with Kernels , 2007, Gaussian Processes in Practice.

[4]  Theodore J. Perkins,et al.  Reinforcement learning for POMDPs based on action values and stochastic optimization , 2002, AAAI/IAAI.

[5]  Peter Stone,et al.  An empirical analysis of value function-based and policy search reinforcement learning , 2009, AAMAS.

[6]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[7]  H. Peyton Young,et al.  Strategic Learning and Its Limits , 2004 .

[8]  Peter Stone,et al.  Reinforcement Learning for RoboCup Soccer Keepaway , 2005, Adapt. Behav..

[9]  Mark D. Pendrith,et al.  An Analysis of Direct Reinforcement Learning in Non-Markovian Domains , 1998, ICML.

[10]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[11]  Paul A. Crook Learning in a state of confusion : employing active perception and reinforcement learning in partially observable worlds , 2007 .

[12]  Peter Stone,et al.  Learning Complementary Multiagent Behaviors: A Case Study , 2009, RoboCup.

[13]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[14]  Risto Miikkulainen,et al.  Evolving Soccer Keepaway Players Through Task Decomposition , 2005, Machine Learning.

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[17]  Luca Iocchi,et al.  Improving the performance of complex agent plans through reinforcement learning , 2010, AAMAS.

[18]  Shimon Whiteson,et al.  Transfer via inter-task mappings in policy search reinforcement learning , 2007, AAMAS '07.

[19]  Theodore J. Perkins,et al.  On the Existence of Fixed Points for Q-Learning and Sarsa in Partially Observable Domains , 2002, ICML.

[20]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .