论文信息 - Hill Climbing on Value Estimates for Search-control in Dyna

Hill Climbing on Value Estimates for Search-control in Dyna

Dyna is an architecture for model-based reinforcement learning (RL), where simulated experience from a model is used to update policies or value functions. A key component of Dyna is search-control, the mechanism to generate the state and action from which the agent queries the model, which remains largely unexplored. In this work, we propose to generate such states by using the trajectory obtained from Hill Climbing (HC) the current estimate of the value function. This has the effect of propagating value from high-value regions and of preemptively updating value estimates of the regions that the agent is likely to visit next. We derive a noisy projected natural gradient algorithm for hill climbing, and highlight a connection to Langevin dynamics. We provide an empirical demonstration on four classical domains that our algorithm, HC-Dyna, can obtain significant sample efficiency improvements. We study the properties of different sampling distributions for search-control, and find that there appears to be a benefit specifically from using the samples generated by climbing on current value estimates from low-value to high-value region.

[1] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[2] R. Tweedie,et al. Exponential convergence of Langevin distributions and their discrete approximations , 1996 .

[3] Alborz Geramifard,et al. Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[4] J. Peng,et al. Efficient Learning and Planning Within the Dyna Framework , 1993, IEEE International Conference on Neural Networks.

[5] C. Hwang,et al. Diffusion for global optimization in R n , 1987 .

[6] Hiroshi Nakagawa,et al. Approximation Analysis of Stochastic Gradient Langevin Dynamics by using Fokker-Planck Equation and Ito Process , 2014, ICML.

[7] Yee Whye Teh,et al. Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[8] Richard S. Sutton,et al. Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[9] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[10] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[11] Euclid,et al. The annals of applied probability : an official journal of the Institute of Mathematical Statistics. , 1991 .

[12] Bruno Castro da Silva,et al. Energetic Natural Gradient Descent , 2016, ICML.

[13] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[14] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[15] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[16] Andrew W. Moore,et al. Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[17] John N. Tsitsiklis,et al. Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[18] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[19] Sergey Levine,et al. Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[20] H. JoséAntonioMartín,et al. Dyna-H: A heuristic planning reinforcement learning algorithm applied to role-playing game strategy decision systems , 2011, Knowl. Based Syst..

[21] Martha White,et al. Organizing Experience: a Deeper Look at Replay Mechanisms for Sample-Based Planning in Continuous State Domains , 2018, IJCAI.

[22] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[23] Rich Sutton,et al. A Deeper Look at Planning as Learning from Replay , 2015, ICML.

[24] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[25] M. V. Rossum,et al. In Neural Computation , 2022 .

[26] Shun-ichi Amari,et al. Why natural gradient? , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[27] Robert Babuska,et al. Experience Replay for Real-Time Reinforcement Learning Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[28] Erik Talvitie,et al. The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces , 2018, ArXiv.

[29] Wulfram Gerstner,et al. Efficient Model-Based Deep Reinforcement Learning with Variational State Tabulation , 2018, ICML.

[30] Kam-Fai Wong,et al. Integrating planning for task-completion dialogue policy learning , 2018, ACL.

[31] D. Signorini,et al. Neural networks , 1995, The Lancet.

[32] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[33] V. Kaul,et al. Planning , 2012 .

[34] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[35] Seth Hutchinson,et al. An integrated architecture for learning and planning in robotic domains , 1991, SGAR.

[36] Michael I. Jordan,et al. Advances in Neural Information Processing Systems 30 , 1995 .

[37] É. Moulines,et al. Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm , 2015, 1507.05021.