论文信息 - Bayesian Q-Learning

Bayesian Q-Learning

A central problem in learning in complex environments is balancing exploration of untested actions against exploitation of actions that are known to be good. The benefit of exploration can be estimated using the classical notion of Value of Information-the expected improvement in future decision quality that might arise from the information acquired by exploration. Estimating this quantity requires an assessment of the agent's uncertainty about its current value estimates for states. In this paper, we adopt a Bayesian approach to maintaining this uncertain information. We extend Watkins' Q-learning by maintaining and propagating probability distributions over the Q-values. These distributions are used to compute a myopic approximation to the value of information for each action and hence to select the action that best balances exploration and exploitation. We establish the convergence properties of our algorithm and show experimentally that it can exhibit substantial improvements over other well-known model-free exploration strategies.

Stuart J. Russell | Nir Friedman | Richard Dearden | R. Dearden | N. Friedman

[1] Milton Abramowitz,et al. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[2] Ronald A. Howard,et al. Information Value Theory , 1966, IEEE Trans. Syst. Sci. Cybern..

[3] Irene A. Stegun,et al. Handbook of Mathematical Functions. , 1966 .

[4] Donald A. Berry,et al. Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[5] P. W. Jones,et al. Bandit Problems, Sequential Allocation of Experiments , 1987 .

[6] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[7] Stuart J. Russell,et al. Do the right thing - studies in limited rationality , 1991 .

[8] D. Sofge. THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .

[9] Leslie Pack Kaelbling,et al. Learning in embedded systems , 1993 .

[10] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[11] Simon Parsons,et al. Do the right thing - studies in limited rationality by Stuart Russell and Eric Wefald, MIT Press, Cambridge, MA, £24.75, ISBN 0-262-18144-4 , 1994, The Knowledge Engineering Review.

[12] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[13] Jeremy Wyatt,et al. Exploration and inference in learning from reinforcement , 1998 .