Adaptive Baseline Enhances EM-Based Policy Search: Validation in a View-Based Positioning Task of a Smartphone Balancer

EM-based policy search methods estimate a lower bound of the expected return from the histories of episodes and iteratively update the policy parameters using the maximum of a lower bound of expected return, which makes gradient calculation and learning rate tuning unnecessary. Previous algorithms like Policy learning by Weighting Exploration with the Returns, Fitness Expectation Maximization, and EM-based Policy Hyperparameter Exploration implemented the mechanisms to discard useless low-return episodes either implicitly or using a fixed baseline determined by the experimenter. In this paper, we propose an adaptive baseline method to discard worse samples from the reward history and examine different baselines, including the mean, and multiples of SDs from the mean. The simulation results of benchmark tasks of pendulum swing up and cart-pole balancing, and standing up and balancing of a two-wheeled smartphone robot showed improved performances. We further implemented the adaptive baseline with mean in our two-wheeled smartphone robot hardware to test its performance in the standing up and balancing task, and a view-based approaching task. Our results showed that with adaptive baseline, the method outperformed the previous algorithms and achieved faster, and more precise behaviors at a higher successful rate.

[1]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[2]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[3]  Doya Kenji,et al.  Standing-up and Balancing Behaviors of Android Phone Robot -- Control of Spring-attached Wheeled Inverted Pendulum , 2013 .

[4]  Frank Sehnke,et al.  Parameter-exploring policy gradients , 2010, Neural Networks.

[5]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[6]  Tom Schaul,et al.  Fitness Expectation Maximization , 2008, PPSN.

[7]  Kenji Doya,et al.  The Cyber Rodent Project: Exploration of Adaptive Mechanisms for Self-Preservation and Self-Reproduction , 2005, Adapt. Behav..

[8]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[9]  Jun Morimoto,et al.  Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration , 2012, Neural Computation.

[10]  Gang Niu,et al.  Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[11]  Martin A. Riedmiller,et al.  Evaluation of Policy Gradient Methods and Variants on the Cart-Pole Benchmark , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[12]  Olivier Sigaud,et al.  Path Integral Policy Improvement with Covariance Matrix Adaptation , 2012, ICML.

[13]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[14]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[15]  Shie Mannor,et al.  The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[16]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[17]  Nikolaus Hansen,et al.  Completely Derandomized Self-Adaptation in Evolution Strategies , 2001, Evolutionary Computation.

[18]  Masashi Sugiyama,et al.  Reward-Weighted Regression with Sample Reuse for Direct Policy Search in Reinforcement Learning , 2011, Neural Computation.

[19]  Kenji Doya,et al.  EM-based policy hyper parameter exploration: application to standing and balancing of a two-wheeled smartphone robot , 2015, Artificial Life and Robotics.

[20]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[21]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[22]  Jan Peters,et al.  Data-Efficient Generalization of Robot Skills with Contextual Policy Search , 2013, AAAI.

[23]  Paolo Dario,et al.  Special issue on robotics and neuroscience , 2008, Neural Networks.

[24]  Luís Paulo Reis,et al.  Regularized covariance estimation for weighted maximum likelihood policy search methods , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).