Free energy based policy gradients

Despite the plethora of reinforcement learning algorithms in machine learning and control, the majority of the work in this area relies on discrete time formulations of stochastic dynamics. In this work we present a new policy gradient algorithm for reinforcement learning in continuous state action spaces and continuous time for free energy-like cost functions. The derivation is based on successive application of Girsanov's theorem and the use of the Radon Nikodým derivative as formulated for Markov diffusion processes. The resulting policy gradient is reward weighted. The use of Radon Nikodým extends analysis and results to more general models of stochasticity in which jump diffusions processes are considered. We apply the resulting algorithm in two simple examples for learning attractor landscapes in rhythmic and discrete movements.

[1]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  H. Kushner,et al.  A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems , 1991 .

[3]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[4]  W. Fleming,et al.  Controlled Markov processes and viscosity solutions , 1992 .

[5]  M. James Controlled markov processes and viscosity solutions , 1994 .

[6]  Wolfgang J. Runggaldier,et al.  Connections between stochastic control and dynamic games , 1996, Math. Control. Signals Syst..

[7]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[8]  Alex M. Andrew,et al.  ROBOT LEARNING, edited by Jonathan H. Connell and Sridhar Mahadevan, Kluwer, Boston, 1993/1997, xii+240 pp., ISBN 0-7923-9365-1 (Hardback, 218.00 Guilders, $120.00, £89.95). , 1999, Robotica (Cambridge. Print).

[9]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10]  Geoffrey E. Hinton,et al.  Using EM for Reinforcement Learning , 2000 .

[11]  Shin Ishii,et al.  Reinforcement Learning for Biped Locomotion , 2002, ICANN.

[12]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[13]  Sanjoy K. Mitter,et al.  A Variational Approach to Nonlinear Estimation , 2003, SIAM J. Control. Optim..

[14]  Jun Morimoto,et al.  Robust Reinforcement Learning , 2005, Neural Computation.

[15]  H. Sebastian Seung,et al.  Learning to Walk in 20 Minutes , 2005 .

[16]  Rémi Munos,et al.  Sensitivity Analysis Using It[o-circumflex]--Malliavin Calculus and Martingales, and Application to Stochastic Optimal Control , 2005, SIAM J. Control. Optim..

[17]  Jun Morimoto,et al.  Learning CPG Sensory Feedback with Policy Gradient for Biped Locomotion for a Full-Body Humanoid , 2005, AAAI.

[18]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[19]  Charalambos D. Charalambous,et al.  Stochastic Uncertain Systems Subject to Relative Entropy Constraints: Induced Norms and Monotonicity Properties of Minimax Games , 2007, IEEE Transactions on Automatic Control.

[20]  Floyd B. Hanson,et al.  Applied stochastic processes and control for jump-diffusions - modeling, analysis, and computation , 2007, Advances in design and control.

[21]  Stefan Schaal,et al.  Learning to Control in Operational Space , 2008, Int. J. Robotics Res..

[22]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[23]  H. Touchette The large deviation approach to statistical mechanics , 2008, 0804.0327.

[24]  Marc Toussaint,et al.  Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[25]  Carl E. Rasmussen,et al.  Gaussian process dynamic programming , 2009, Neurocomputing.

[26]  Jan Peters,et al.  Policy Search for Motor Primitives , 2009, Künstliche Intell..

[27]  Emanuel Todorov,et al.  Policy gradients in linearly-solvable MDPs , 2010, NIPS.

[28]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[29]  Stefan Schaal,et al.  Variable Impedance Control - A Reinforcement Learning Approach , 2010, Robotics: Science and Systems.

[30]  Stefan Schaal,et al.  Learning variable impedance control , 2011, Int. J. Robotics Res..

[31]  Evangelos A. Theodorou,et al.  Iterative path integral stochastic optimal control: Theory and applications to motor control , 2011 .

[32]  Machine Learning of Motor Skills for Robotics, Jan Peters , 2022 .