Bayesian Time Series Models: Expectation maximisation methods for solving (PO)MDPs and optimal control problems

As this book demonstrates, the development of efficient probabilistic inference techniques has made considerable progress in recent years, in particular with respect to exploiting the structure (e.g., factored, hierarchical or relational) of discrete and continuous problem domains. In this chapter we show that these techniques can be used also for solving Markov Decision Processes (MDPs) or partial observable MDPs (POMDPs) when formulated in terms of a structured dynamic Bayesian network (DBN). The problems of planning in stochastic environments and inference in state space models are closely related, in particular in view of the challenges both of them face: scaling to large state spaces spanned by multiple state variables, or realizing planning (or inference) in continuous or mixed continuous-discrete state spaces. Both fields developed techniques to address these problems. For instance, in the field of planning, they include work on Factored Markov Decision Processes (Boutilier et al., 1995; Koller and Parr, 1999; Guestrin et al., 2003; Kveton and Hauskrecht, 2005), abstractions (Hauskrecht et al., 1998), and relational models of the environment (Zettlemoyer et al., 2005). On the other hand, recent advances in inference techniques show how structure can be exploited both for exact inference as well as making efficient approximations. Examples are message passing algorithms (loopy Belief Propagation, Expectation Propagation), variational approaches, approximate belief representations (particles, Assumed Density Filtering, Boyen-Koller) and arithmetic compilation (see, e.g., Minka, 2001; Murphy, 2002; Chavira et al., 2006). In view of these similarities one may ask whether existing techniques for probabilistic inference can directly be translated to solving stochastic planning problems. From a complexity theoretic point of view, the equivalence between inference and planning is well-known (see, e.g., Littman et al., 2001). Inference methods have been applied before to optimal decision making in Influence Diagrams (Cooper, 1988; Pearl, 1988; Shachter, 1988). However, contrary to MDPs, these methods focus on a finite number of decisions and a non-stationary policy, where optimal decisions are found by recursing backward starting from the last decision (see (Boutilier et al., 1999) and (Toussaint, 2009) for a discussion of MDPs versus Influence Diagrams). More recently, Bui et al. (2002) have used inference on Abstract Hidden Markov Models for policy recognition, i.e., for reasoning about executed behaviors, but do not address the problem of computing optimal policies from such inference. Attias

[1]  Nando de Freitas,et al.  An Expectation Maximization Algorithm for Continuous Markov Decision Processes with Arbitrary Reward , 2009, AISTATS.

[2]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[3]  Botond Cseke,et al.  Advances in Neural Information Processing Systems 20 (NIPS 2007) , 2008 .

[4]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[5]  Toniann Pitassi,et al.  Stochastic Boolean Satisfiability , 2001, Journal of Automated Reasoning.

[6]  Rajesh P. N. Rao,et al.  Goal-Based Imitation as Probabilistic Inference over Graphical Models , 2005, NIPS.

[7]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[8]  Craig Boutilier,et al.  Decision-Theoretic Planning: Structural Assumptions and Computational Leverage , 1999, J. Artif. Intell. Res..

[9]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[10]  Nando de Freitas,et al.  New inference strategies for solving Markov Decision Processes using reversible jump MCMC , 2009, UAI.

[11]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[12]  Shobha Venkataraman,et al.  Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[13]  Tom Minka,et al.  A family of algorithms for approximate Bayesian inference , 2001 .

[14]  Christopher G. Atkeson,et al.  A comparison of direct and model-based reinforcement learning , 1997, Proceedings of International Conference on Robotics and Automation.

[15]  Ross D. Shachter Probabilistic Inference and Influence Diagrams , 1988, Oper. Res..

[16]  Manfred Jaeger,et al.  Compiling relational Bayesian networks for exact inference , 2006, Int. J. Approx. Reason..

[17]  T. Raiko,et al.  Learning nonlinear state-space models for control , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[18]  Leslie Pack Kaelbling,et al.  Representing hierarchical POMDPs as DBNs for multi-scale robot localization , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[19]  Marc Toussaint,et al.  Probabilistic inference for solving (PO) MDPs , 2006 .

[20]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[21]  Marc Toussaint,et al.  Hierarchical POMDP Controller Optimization by Likelihood Maximization , 2008, UAI.

[22]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[23]  Milos Hauskrecht,et al.  An MCMC Approach to Solving Hybrid Factored MDPs , 2005, IJCAI.

[24]  Nando de Freitas,et al.  Bayesian Policy Learning with Trans-Dimensional MCMC , 2007, NIPS.

[25]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .

[26]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[27]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[28]  Svetha Venkatesh,et al.  Policy Recognition in the Abstract Hidden Markov Model , 2002, J. Artif. Intell. Res..

[29]  Milos Hauskrecht,et al.  Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[30]  Hagai Attias,et al.  Planning by Probabilistic Inference , 2003, AISTATS.

[31]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[32]  Daphne Koller,et al.  Computing Factored Value Functions for Policies in Structured MDPs , 1999, IJCAI.

[33]  Andrew Y. Ng,et al.  Policy Search via Density Estimation , 1999, NIPS.

[34]  Kee-Eung Kim,et al.  Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[35]  Leslie Pack Kaelbling,et al.  Learning Planning Rules in Noisy Stochastic Worlds , 2005, AAAI.

[36]  Marc Toussaint,et al.  Model-free reinforcement learning as mixture learning , 2009, ICML '09.

[37]  Craig Boutilier,et al.  Bounded Finite State Controllers , 2003, NIPS.