An Actor-critic Algorithm Using Cross Evaluation of Value Functions

In order to overcome the difficulty of learning a global optimal policy caused by maximization bias in a continuous space, an actor-critic algorithm for cross evaluation of double value function is proposed. Two independent value functions make the critique closer to the real value function. And the actor is guided by a crossover function to choose its optimal actions. Cross evaluation of value functions avoids the policy jitter phenomenon behaved by greedy optimization methods in continuous spaces. The algorithm is more robust than CACLA learning algorithm, and the experimental results show that our algorithm is smoother and the stability of policy is improved obviously under the condition that the computation remains almost unchanged.

[1]  J. Martín H.,et al.  Ex〈α〉: An effective algorithm for continuous actions Reinforcement Learning problems , 2009, 2009 35th Annual Conference of IEEE Industrial Electronics.

[2]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[3]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[4]  Edmund K. Burke,et al.  The Genetic and Evolutionary Computation Conference , 2011 .

[5]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[6]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[7]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[8]  Dirk P. Kroese,et al.  Cross‐Entropy Method , 2011 .

[9]  Mehdi Khamassi,et al.  Active exploration in parameterized reinforcement learning , 2016, ArXiv.

[10]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[11]  Dirk P. Kroese,et al.  Chapter 3 – The Cross-Entropy Method for Optimization , 2013 .

[12]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[13]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[14]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[15]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[16]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[17]  Victor Uc Cetina,et al.  Reinforcement learning in continuous state and action spaces , 2009 .

[18]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[19]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[20]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.