Student-t policy in reinforcement learning to acquire global optimum of robot control

This paper proposes an actor-critic algorithm with a policy parameterized by student-t distribution, named student-t policy, to enhance learning performance, mainly in terms of reachability on global optimum for tasks to be learned. The actor-critic algorithm is one of the policy-gradient methods in reinforcement learning, and is proved to learn the policy converging on one of the local optima. To avoid the local optima, an exploration ability to escape it and a conservative learning not to be trapped in it are deemed to be empirically effective. The conventional policy parameterized by a normal distribution, however, fundamentally lacks these abilities. The state-of-the-art methods can somewhat but not perfectly compensate for them. Conversely, heavy-tailed distribution, including student-t distribution, possesses an excellent exploration ability, which is called Lévy flight for modeling efficient feed detection of animals. Another property of the heavy tail is its robustness to outliers. Namely, conservative learning is performed to not be trapped in the local optima even when it takes extreme actions. These desired properties of the student-t policy enhance the possibility of the agents reaching the global optimum. Indeed, the student-t policy outperforms the conventional policy in four types of simulations, two of which are difficult to learn faster without sufficient exploration and the others have the local optima.

[1]  Surya P. N. Singh,et al.  V-REP: A versatile and scalable robot simulation framework , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2]  Toshio Fukuda,et al.  Selection Algorithm for Locomotion Based on the Evaluation of Falling Risk , 2015, IEEE Transactions on Robotics.

[3]  Andrew Gordon Wilson,et al.  Student-t Processes as Alternatives to Gaussian Processes , 2014, AISTATS.

[4]  Luisa Canal,et al.  A normal approximation for the chi-square distribution , 2005, Comput. Stat. Data Anal..

[5]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[6]  Joonho Lee,et al.  Learning agile and dynamic motor skills for legged robots , 2019, Science Robotics.

[7]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[8]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[9]  Sebastian Scherer,et al.  Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution , 2017, ICML.

[10]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[11]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[12]  Avinash C. Kak,et al.  A Novel Parameter Estimation Algorithm for the Multivariate t-Distribution and Its Application to Computer Vision , 2010, ECCV.

[13]  Jeremy MG Taylor,et al.  Robust Statistical Modeling Using the t Distribution , 1989 .

[14]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[15]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Takamitsu Matsubara,et al.  Kernel dynamic policy programming: Applicable reinforcement learning to robot systems with high dimensional states , 2017, Neural Networks.

[17]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[18]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  Takamitsu Matsubara,et al.  Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation , 2019, Robotics Auton. Syst..

[23]  Weikuan Jia,et al.  Applications of asynchronous deep reinforcement learning based on dynamic updating weights , 2018, Applied Intelligence.

[24]  Patrick M. Pilarski,et al.  True Online Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[25]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[26]  Reinaldo B. Arellano-Valle On the information matrix of the multivariate skew-t model , 2010 .

[27]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[28]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[29]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[30]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[31]  Christopher M. Bishop,et al.  Robust Bayesian Mixture Modelling , 2005, ESANN.

[32]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[33]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[34]  Frederic Bartumeus,et al.  ANIMAL SEARCH STRATEGIES: A QUANTITATIVE RANDOM‐WALK ANALYSIS , 2005 .

[35]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[36]  Richard E. Turner,et al.  Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.

[37]  Javier E. Contreras-Reyes Asymptotic form of the Kullback–Leibler divergence for multivariate asymmetric heavy-tailed distributions , 2014 .

[38]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[39]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[40]  T. Takenaka,et al.  The development of Honda humanoid robot , 1998, Proceedings. 1998 IEEE International Conference on Robotics and Automation (Cat. No.98CH36146).

[41]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.