Reinforcement Learning of POMDPs using Spectral Methods

Author(s): Azizzadenesheli, Kamyar; Lazaric, Alessandro; Anandkumar, Animashree | Abstract: We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through episodes, in each episode we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the episode, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound with respect to the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces.

[1]  Joelle Pineau,et al.  Efficient learning and planning with compressed predictive states , 2013, J. Mach. Learn. Res..

[2]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[3]  Satinder P. Singh,et al.  Experimental Results on Learning Stochastic Memoryless Policies for Partially Observable Markov Decision Processes , 1998, NIPS.

[4]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[5]  Emma Brunskill,et al.  A PAC RL Algorithm for Episodic POMDPs , 2016, AISTATS.

[6]  Alessandro Lazaric,et al.  Regret Bounds for Reinforcement Learning with Policy Advice , 2013, ECML/PKDD.

[7]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[8]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[9]  Joelle Pineau,et al.  Building Adaptive Dialogue Systems Via Bayes-Adaptive POMDPs , 2012, IEEE Journal of Selected Topics in Signal Processing.

[10]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[11]  Pascal Poupart,et al.  Model-based Bayesian Reinforcement Learning in Partially Observable Domains , 2008, ISAIM.

[12]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[13]  Le Song,et al.  Nonparametric Estimation of Multi-View Latent Variable Models , 2013, ICML.

[14]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.

[15]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[16]  Joelle Pineau,et al.  Bayes-Adaptive POMDPs , 2007, NIPS.

[17]  Alessandro Lazaric,et al.  Sequential Transfer in Multi-armed Bandit with Finite Set of Models , 2013, NIPS.

[18]  John Langford,et al.  Contextual-MDPs for PAC-Reinforcement Learning with Rich Observations , 2016, ArXiv.

[19]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[20]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.

[21]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[22]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[23]  Eric Deeson,et al.  Online learning , 2005, Br. J. Educ. Technol..

[24]  Milos Hauskrecht,et al.  Planning treatment of ischemic heart disease with partially observable Markov decision processes , 2000, Artif. Intell. Medicine.

[25]  Joelle Pineau,et al.  Efficient Planning and Tracking in POMDPs with Large Observation Spaces , 2006 .

[26]  Alessandro Lazaric,et al.  Stochastic Optimization of a Locally Smooth Function under Correlated Bandit Feedback , 2014, ArXiv.

[27]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[28]  Omid Madani On the Computability of Infinite-Horizon Partially Observable Markov Decision Processes , 2007 .

[29]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[30]  L. Tong,et al.  Online Learning and Optimization of Markov Jump Affine Models , 2016, ArXiv.

[31]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[32]  Pascal Poupart,et al.  Partially Observable Markov Decision Processes , 2010, Encyclopedia of Machine Learning.

[33]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[34]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[35]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[36]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[37]  Aryeh Kontorovich,et al.  On learning parametric-output HMMs , 2013, ICML.

[38]  Aryeh Kontorovich,et al.  Uniform Chernoff and Dvoretzky-Kiefer-Wolfowitz-Type Inequalities for Markov Chains and Related Processes , 2012, J. Appl. Probab..

[39]  Ronald Ortner,et al.  Selecting Near-Optimal Approximate State Representations in Reinforcement Learning , 2014, ALT.

[40]  Theodore J. Perkins,et al.  Reinforcement learning for POMDPs based on action values and stochastic optimization , 2002, AAAI/IAAI.

[41]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[42]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[43]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[44]  L. Meng,et al.  The optimal perturbation bounds of the Moore–Penrose inverse under the Frobenius norm , 2010 .

[45]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[46]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[47]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[48]  K. Ramanan,et al.  Concentration Inequalities for Dependent Random Variables via the Martingale Method , 2006, math/0609835.

[49]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[50]  Hongsheng Xi,et al.  Finding optimal memoryless policies of POMDPs under the expected average reward criterion , 2011, Eur. J. Oper. Res..

[51]  Csaba Szepesvári,et al.  Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path , 2015, NIPS.

[52]  Edward J. Sondik,et al.  The optimal control of par-tially observable Markov processes , 1971 .

[53]  Byron Boots,et al.  Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[54]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[55]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.