Multi-optima exploration with adaptive Gaussian mixture model

In learning by exploration problems such as reinforcement learning (RL), direct policy search, stochastic optimization or evolutionary computation, the goal of an agent is to maximize some form of reward function (or minimize a cost function). Often, these algorithms are designed to find a single policy solution. We address the problem of representing the space of control policy solutions by considering exploration as a density estimation problem. Such representation provides additional information such as shape and curvature of local peaks that can be exploited to analyze the discovered solutions and guide the exploration. We show that the search process can easily be generalized to multi-peaked distributions by employing a Gaussian mixture model (GMM) with an adaptive number of components. The GMM has a dual role: representing the space of possible control policies, and guiding the exploration of new policies. A variation of expectation-maximization (EM) applied to reward-weighted policy parameters is presented to model the space of possible solutions, as if this space was a probability distribution. The approach is tested in a dart game experiment formulated as a black-box optimization problem, where the agent's throwing capability increases while it chases for the best strategy to play the game. This experiment is used to study how the proposed approach can exploit new promising solution alternatives in the search process, when the optimality criterion slowly drifts over time. The results show that the proposed multi-optima search approach can anticipate such changes by exploiting promising candidates to smoothly adapt to the change of global optimum.

[1]  Jan Peters,et al.  Imitation and Reinforcement Learning: Practical Algorithms for Motor Primitives in Robotics , 2010 .

[2]  Nikolaus Hansen,et al.  The CMA Evolution Strategy: A Comparing Review , 2006, Towards a New Evolutionary Computation.

[3]  Stefan Schaal,et al.  Learning Control in Robotics , 2010, IEEE Robotics & Automation Magazine.

[4]  Nikolaos G. Tsagarakis,et al.  Statistical dynamical systems for skills acquisition in humanoids , 2012, 2012 12th IEEE-RAS International Conference on Humanoid Robots (Humanoids 2012).

[5]  J. Peters,et al.  Using Reward-weighted Regression for Reinforcement Learning of Task Space Control , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[6]  Marc Toussaint,et al.  Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[7]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[8]  Zhihua Zhang,et al.  EM algorithms for Gaussian mixtures with split-and-merge operation , 2003, Pattern Recognit..

[9]  Dagmar Sternad,et al.  Neuromotor Noise, Error Tolerance and Velocity-Dependent Costs in Skilled Performance , 2011, PLoS Comput. Biol..

[10]  Dirk P. Kroese,et al.  The Generalized Cross Entropy Method, with Applications to Probability Density Estimation , 2011 .

[11]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[12]  Jonathan Taylor,et al.  A statistician plays darts , 2011 .

[13]  Jan Peters,et al.  Nonamemanuscript No. (will be inserted by the editor) Reinforcement Learning to Adjust Parametrized Motor Primitives to , 2011 .

[14]  Dagmar Sternad,et al.  Coordinate Dependence of Variability Analysis , 2010, PLoS Comput. Biol..

[15]  Jindrich Kodl,et al.  The CNS Stochastically Selects Motor Plan Utilizing Extrinsic and Intrinsic Representations , 2011, PloS one.

[16]  Olivier Sigaud,et al.  Path Integral Policy Improvement with Covariance Matrix Adaptation , 2012, ICML.

[17]  Pieter Abbeel,et al.  Apprenticeship learning for helicopter control , 2009, CACM.

[18]  Tom Schaul,et al.  Exploring parameter space in reinforcement learning , 2010, Paladyn J. Behav. Robotics.

[19]  D. Kerwin,et al.  Elite sprinting: are athletes individually step-frequency or step-length reliant? , 2011, Medicine and science in sports and exercise.

[20]  Bruno Castro da Silva,et al.  Learning Parameterized Skills , 2012, ICML.

[21]  Marin Kobilarov,et al.  Cross-entropy motion planning , 2012, Int. J. Robotics Res..

[22]  Dirk P. Kroese,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning , 2004 .

[23]  Jan Peters,et al.  Imitation and Reinforcement Learning , 2010, IEEE Robotics & Automation Magazine.

[24]  Noah J. Cowan,et al.  Efficient Gradient Estimation for Motor Control Learning , 2002, UAI.