An Adaptive Learning Rate Q-Learning Algorithm Based on Kalman Filter Inspired by Pigeon Pecking-Color Learning

The speed and accuracy of the Q-learning algorithm are critically affected by the learning rate. In most Q-learning application, the learning rate is usually set as a constant or decayed in a predetermined way, so it cannot meet the needs of dynamic and rapid learning. In this study, the learning process of pigeon pecking-color task was analyzed. We observed that there was epiphany phenomenon during pigeon’s learning process. The learning rate did not change gradually, but was large in the early stage and disappeared in the middle and late stage. Inspired by these phenomena, an adaptive learning rate Q-learning algorithm based on Kalman filter model (ALR\(_{-}\)KF Q-learning) is proposed in this paper. Q-learning are represented in the framework of Kalman filter model, and the learning rate is equivalent to Kalman gain, which dynamically weighs the fluctuation of environmental reward and the cognitive uncertainty of the agent to the value of \({ }\) pairs. The cognitive uncertainty in the model is determined by the variance of measurement residual and of environmental reward, and is set to zero when it is less than the variance of the environmental reward. The results tested by the two-armed Bandit task showed that the proposed algorithm not only can adaptively learn the statistical characteristics of environmental rewards, but also can quickly and accurately approximate the expected value of \({ }\) pairs.

[1]  Frank L. Lewis,et al.  Optimal and Autonomous Control Using Reinforcement Learning: A Survey , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Mojtaba Khederzadeh,et al.  Least square and Kalman based methods for dynamic phasor estimation: a review , 2017 .

[3]  Xiaogang Ruan,et al.  Skinner-Pigeon Experiment Simulated Based on Probabilistic Automata , 2009, 2009 WRI Global Congress on Intelligent Systems.

[4]  Yu Bai,et al.  Dual learning processes underlying human decision-making in reversal learning tasks: functional significance and evidence from the model fit to human behavior , 2014, Front. Psychol..

[5]  N. Logothetis,et al.  Is the frontal lobe involved in conscious perception? , 2014, Front. Psychol..

[6]  Jonas Rose,et al.  Theory meets pigeons: The influence of reward-magnitude on discrimination-learning , 2009, Behavioural Brain Research.

[7]  Robert Babuska,et al.  Learning rate free reinforcement learning for real-time motion control using a value-gradient based policy , 2014 .

[8]  Ramsey Michael Faragher,et al.  Understanding the Basis of the Kalman Filter Via a Simple and Intuitive Derivation [Lecture Notes] , 2012, IEEE Signal Processing Magazine.

[9]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[10]  Miguel A. Solis,et al.  Accelerating Q-Learning through Kalman Filter Estimations Applied in a RoboCup SSL Simulation , 2013, 2013 Latin American Robotics Symposium and Competition.

[11]  Shie Mannor,et al.  Trust Region Value Optimization using Kalman Filtering , 2019, ArXiv.

[12]  Frank L. Lewis,et al.  Off-Policy Interleaved $Q$ -Learning: Optimal Control for Affine Nonlinear Discrete-Time Systems , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Arturo Bouzas,et al.  Velocity Estimation in Reinforcement Learning , 2019 .

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Lucian Busoniu,et al.  Reinforcement learning for control: Performance, stability, and deep approximators , 2018, Annu. Rev. Control..

[16]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[17]  Koichi Moriyama,et al.  Learning-Rate Adjusting Q-Learning for Prisoner's Dilemma Games , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.