Approximate Inference and Stochastic Optimal Control

We propose a novel reformulation of the stochastic optimal control problem as an approximate inference problem, demonstrating, that such a interpretation leads to new practical methods for the original problem. In particular we characterise a novel class of iterative solutions to the stochastic optimal control problem based on a natural relaxation of the exact dual formulation. These theoretical insights are applied to the Reinforcement Learning problem where they lead to new model free, o policy methods for discrete and continuous problems.

[1]  Ross D. Shachter,et al.  Decision Making Using Probabilistic Inference Methods , 1992, UAI.

[2]  Andrew G. Barto,et al.  Adaptive linear quadratic control using policy iteration , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[3]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[4]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[5]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[6]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[7]  Marc Toussaint,et al.  Probabilistic inference for solving (PO) MDPs , 2006 .

[8]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[9]  R. Ivry,et al.  The coordination of movement: optimal feedback control and beyond , 2010, Trends in Cognitive Sciences.

[10]  Sanjoy K. Mitter,et al.  A Variational Approach to Nonlinear Estimation , 2003, SIAM J. Control. Optim..

[11]  R. Bellman Dynamic programming. , 1957, Science.

[12]  Takamitsu Matsubara,et al.  Optimal Feedback Control for anthropomorphic manipulators , 2010, 2010 IEEE International Conference on Robotics and Automation.

[13]  Robert F. Stengel,et al.  Optimal Control and Estimation , 1994 .

[14]  Tom Minka,et al.  A family of algorithms for approximate Bayesian inference , 2001 .

[15]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..

[16]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[17]  Michael I. Jordan,et al.  Optimal feedback control as a theory of motor coordination , 2002, Nature Neuroscience.

[18]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[19]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[20]  Gregory F. Cooper,et al.  A Method for Using Belief Networks as Influence Diagrams , 2013, UAI 1988.

[21]  Ross D. Shachter Probabilistic Inference and Influence Diagrams , 1988, Oper. Res..

[22]  Michael I. Jordan,et al.  Reinforcement Learning by Probability Matching , 1995, NIPS 1995.

[23]  Marc Toussaint,et al.  Integrated motor control, planning, grasping and high-level reasoning in a blocks world using probabilistic inference , 2010, 2010 IEEE International Conference on Robotics and Automation.

[24]  Emanuel Todorov,et al.  Efficient computation of optimal actions , 2009, Proceedings of the National Academy of Sciences.

[25]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[26]  Weiwei Li,et al.  An Iterative Optimal Control and Estimation Design for Nonlinear Stochastic System , 2006, Proceedings of the 45th IEEE Conference on Decision and Control.

[27]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[29]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[30]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[31]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[32]  D. Barber,et al.  Solving deterministic policy ( PO ) MDPs using Expectation-Maximisation and Antifreeze , 2009 .

[33]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[34]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.