Planning in POMDPs Using Multiplicity Automata

Planning and learning in Partially Observable MDPs (POMDPs) are among the most challenging tasks in both the AI and Operation Research communities. Although solutions to these problems are intractable in general, there might be special cases, such as structured POMDPs, which can be solved efficiently. A natural and possibly efficient way to represent a POMDP is through the predictive state representation (PSR) — a representation which recently has been receiving increasing attention. In this work, we relate POMDPs to multiplicity automata — showing that POMDPs can be represented by multiplicity automata with no increase in the representation size. Furthermore, we show that the size of the multiplicity automaton is equal to the rank of the predictive state representation. Therefore, we relate both the predictive state representation and POMDPs to the well-founded multiplicity automata literature. Based on the multiplicity automata representation, we provide a planning algorithm which is exponential only in the multiplicity automata rank rather than the number of states of the POMDP. As a result, whenever the predictive state representation is logarithmic in the standard POMDP representation, our planning algorithm is efficient.

[1]  Baruch Awerbuch,et al.  Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[2]  Satinder P. Singh,et al.  How to Dynamically Merge Markov Decision Processes , 1997, NIPS.

[3]  Eyal Kushilevitz,et al.  Learning functions represented as multiplicity automata , 2000, JACM.

[4]  Richard S. Sutton,et al.  Predictive Representations of State , 2001, NIPS.

[5]  Michael R. James,et al.  Predictive State Representations: A New Theory for Modeling Dynamical Systems , 2004, UAI.

[6]  Jack W. Carlyle,et al.  Realizations by Stochastic Finite Automata , 1971, J. Comput. Syst. Sci..

[7]  Judy Goldsmith,et al.  Nonapproximability Results for Partially Observable Markov Decision Processes , 2011, Universität Trier, Mathematik/Informatik, Forschungsbericht.

[8]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9]  Ronen I. Brafman,et al.  A Heuristic Variable Grid Solution Method for POMDPs , 1997, AAAI/IAAI.

[10]  Peter Stone,et al.  Learning Predictive State Representations , 2003, ICML.

[11]  Marcel Paul Schützenberger,et al.  On the Definition of a Family of Automata , 1961, Inf. Control..

[12]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[13]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[14]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[15]  Blai Bonet An 2-Optimal Grid-Based Algorithm for Partially Observable Markov Decision Processes , 2002 .

[16]  Blai Bonet,et al.  An epsilon-Optimal Grid-Based Algorithm for Partially Observable Markov Decision Processes , 2002, ICML.

[17]  William S. Lovejoy,et al.  Computationally Feasible Bounds for Partially Observed Markov Decision Processes , 1991, Oper. Res..