Sample-Efficient Reinforcement Learning of Undercomplete POMDPs

Partial observability is a common challenge in many reinforcement learning applications, which requires an agent to maintain memory, infer latent states, and integrate this past information into exploration. This challenge leads to a number of computational and statistical hardness results for learning general Partially Observable Markov Decision Processes (POMDPs). This work shows that these hardness barriers do not preclude efficient reinforcement learning for rich and interesting subclasses of POMDPs. In particular, we present a sample-efficient algorithm, OOM-UCB, for episodic finite undercomplete POMDPs, where the number of observations is larger than the number of latent states and where exploration is essential for learning, thus distinguishing our results from prior works. OOM-UCB achieves an optimal sample complexity of $O(1/\epsilon^2)$ for finding an $\epsilon$-optimal policy, along with being polynomial in all other relevant quantities. As an interesting special case, we also provide a computationally and statistically efficient algorithm for POMDPs with deterministic state transitions.

[1]  Marcel Paul Schützenberger,et al.  On the Definition of a Family of Automata , 1961, Inf. Control..

[2]  Jack W. Carlyle,et al.  Realizations by Stochastic Finite Automata , 1971, J. Comput. Syst. Sci..

[3]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[4]  Leslie Pack Kaelbling,et al.  Acting under uncertainty: discrete Bayesian models for mobile-robot navigation , 1996, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS '96.

[5]  Milos Hauskrecht,et al.  Planning treatment of ischemic heart disease with partially observable Markov decision processes , 2000, Artif. Intell. Medicine.

[6]  Herbert Jaeger,et al.  Observable Operator Models for Discrete Stochastic Time Series , 2000, Neural Computation.

[7]  Eric Allender,et al.  Complexity of finite-horizon Markov decision process problems , 2000, JACM.

[8]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[9]  H. Jaeger Discrete-time, discrete-valued observable operator models: a tutorial , 2003 .

[10]  Elchanan Mossel,et al.  Learning nonsingular phylogenies and hidden Markov models , 2005, STOC '05.

[11]  Yishay Mansour,et al.  Reinforcement Learning in POMDPs Without Resets , 2005, IJCAI.

[12]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[13]  Pascal Poupart,et al.  Model-based Bayesian Reinforcement Learning in Partially Observable Domains , 2008, ISAIM.

[14]  Sham M. Kakade,et al.  A spectral algorithm for learning Hidden Markov Models , 2008, J. Comput. Syst. Sci..

[15]  Brahim Chaib-draa,et al.  Quasi-Deterministic Partially Observable Markov Decision Processes , 2009, ICONIP.

[16]  Blai Bonet,et al.  Deterministic POMDPs Revisited , 2009, UAI.

[17]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[18]  Le Song,et al.  Hilbert Space Embeddings of Hidden Markov Models , 2010, ICML.

[19]  Thomas L. Griffiths,et al.  Faster Teaching by POMDP Planning , 2011, AIED.

[20]  David Barber,et al.  On the Computational Complexity of Stochastic Controller Optimization in POMDPs , 2011, TOCT.

[21]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[22]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[23]  Emma Brunskill,et al.  A PAC RL Algorithm for Episodic POMDPs , 2016, AISTATS.

[24]  Kamyar Azizzadenesheli,et al.  Reinforcement Learning of POMDPs using Spectral Methods , 2016, COLT.

[25]  John Langford,et al.  PAC Reinforcement Learning with Rich Observations , 2016, NIPS.

[26]  Vatsal Sharan,et al.  Learning Overcomplete HMMs , 2017, NIPS.

[27]  Guy Shani,et al.  Iterative Planning for Deterministic QDec-POMDPs , 2018, Global Conference on Artificial Intelligence.

[28]  Yisong Yue,et al.  Policy Gradient in Partially Observable Environments: Approximation and Convergence , 2018 .

[29]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[30]  Michael I. Jordan,et al.  A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm , 2019, ArXiv.