Learning model-free robot control by a Monte Carlo EM algorithm

We address the problem of learning robot control by model-free reinforcement learning (RL). We adopt the probabilistic model for model-free RL of Vlassis and Toussaint (Proceedings of the international conference on machine learning, Montreal, Canada, 2009), and we propose a Monte Carlo EM algorithm (MCEM) for control learning that searches directly in the space of controller parameters using information obtained from randomly generated robot trajectories. MCEM is related to, and generalizes, the PoWER algorithm of Kober and Peters (Proceedings of the neural information processing systems, 2009). In the finite-horizon case MCEM reduces precisely to PoWER, but MCEM can also handle the discounted infinite-horizon case. An interesting result is that the infinite-horizon case can be viewed as a ‘randomized’ version of the finite-horizon case, in the sense that the length of each sampled trajectory is a random draw from an appropriately constructed geometric distribution. We provide some preliminary experiments demonstrating the effects of fixed (PoWER) vs randomized (MCEM) horizon length in two simulated and one real robot control tasks.

[1]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[2]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[3]  Radford M. Neal Monte Carlo Implementation , 1996 .

[4]  Andrew P. Sage,et al.  Uncertainty in Artificial Intelligence , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  M. Goodman Learning to Walk: The Origins of the UK's Joint Intelligence Committee , 2008 .

[6]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[7]  H. Sebastian Seung,et al.  Learning to Walk in 20 Minutes , 2005 .

[8]  Yoon Keun Kwak,et al.  Dynamic Analysis of a Nonholonomic Two-Wheeled Inverted Pendulum Robot , 2005, J. Intell. Robotic Syst..

[9]  Nando de Freitas,et al.  A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot , 2009, Auton. Robots.

[10]  Marc Toussaint,et al.  Model-free reinforcement learning as mixture learning , 2009, ICML '09.

[11]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Nando de Freitas,et al.  Bayesian Policy Learning with Trans-Dimensional MCMC , 2007, NIPS.

[14]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[15]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[16]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Gregory F. Cooper,et al.  A Method for Using Belief Networks as Influence Diagrams , 2013, UAI 1988.

[19]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[20]  Jan Peters,et al.  Policy Search for Motor Primitives , 2009, Künstliche Intell..

[21]  Martin A. Riedmiller,et al.  Reinforcement learning for robot soccer , 2009, Auton. Robots.

[22]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[23]  Jürgen Schmidhuber,et al.  State-Dependent Exploration for Policy Gradient Methods , 2008, ECML/PKDD.

[24]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[25]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[26]  Marc Toussaint,et al.  Probabilistic inference for solving (PO) MDPs , 2006 .

[27]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[28]  Jan Peters,et al.  Using reward-weighted imitation for robot Reinforcement Learning , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.