A PAC RL Algorithm for Episodic POMDPs

Many interesting real world domains involve reinforcement learning (RL) in partially observable environments. Efficient learning in such domains is important, but existing sample complexity bounds for partially observable RL are at least exponential in the episode length. We give, to our knowledge, the first partially observable RL algorithm with a polynomial bound on the number of episodes on which the algorithm may not achieve near-optimal performance. Our algorithm is suitable for an important class of episodic POMDPs. Our approach builds on recent advances in method of moments for latent variable model estimation.

[1]  Naoki Abe,et al.  On the computational complexity of approximating distributions by probabilistic automata , 1990, Machine Learning.

[2]  Joelle Pineau,et al.  A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes , 2011, J. Mach. Learn. Res..

[3]  Dean Alderucci A SPECTRAL ALGORITHM FOR LEARNING HIDDEN MARKOV MODELS THAT HAVE SILENT STATES , 2015 .

[4]  Leonid Peshkin,et al.  Bounds on Sample Size for Policy Evaluation in Markov Environments , 2001, COLT/EuroCOLT.

[5]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[6]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[7]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[8]  Stefanos Nikolaidis,et al.  Efficient Model Learning from Joint-Action Demonstrations for Human-Robot Collaborative Tasks , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[9]  Le Song,et al.  Hilbert Space Embeddings of Hidden Markov Models , 2010, ICML.

[10]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[11]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[12]  Lihong Li,et al.  A Bayesian Sampling Approach to Exploration in Reinforcement Learning , 2009, UAI.

[13]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[14]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[15]  Li Ling Ko,et al.  Structured Parameter Elicitation , 2010, AAAI.

[16]  Craig Boutilier,et al.  A POMDP formulation of preference elicitation problems , 2002, AAAI/IAAI.

[17]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[18]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[19]  Alessandro Lazaric,et al.  Sequential Transfer in Multi-armed Bandit with Finite Set of Models , 2013, NIPS.

[20]  Joelle Pineau,et al.  A Variance Analysis for POMDP Policy Evaluation , 2008, AAAI.

[21]  Nicholas Roy,et al.  Bayesian nonparametric approaches for reinforcement learning in partially observable domains , 2012 .

[22]  Christopher Amato,et al.  Diagnose and Decide: An Optimal Bayesian Approach , 2012, NIPS 2012.

[23]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[24]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[25]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[26]  Yishay Mansour,et al.  Reinforcement Learning in POMDPs Without Resets , 2005, IJCAI.

[27]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[28]  Lihong Li,et al.  Incremental Model-based Learners With Formal Learning-Time Guarantees , 2006, UAI.

[29]  Masoumeh T. Izadi,et al.  Sensitivity Analysis of POMDP Value Functions , 2009, 2009 International Conference on Machine Learning and Applications.

[30]  Joelle Pineau,et al.  Point-based value iteration: An anytime algorithm for POMDPs , 2003, IJCAI.