Balancing exploration and exploitation in reinforcement learning using a value of information criterion

In this paper, we consider an information-theoretic approach for addressing the exploration-exploitation dilemma in reinforcement learning. We employ the value of information, a criterion that provides the optimal trade-off between the expected returns and a policy's degrees of freedom. As the degrees of freedom are reduced, an agent will exploit more than explore. As the policy degrees of freedom increase, an agent will explore more than exploit. We provide an efficient computational procedure for constructing policies using the value of information. The performance is demonstrated on a standard reinforcement learning benchmark problem.

[1]  P. Dayan,et al.  Exploration bonuses and dual control , 1996 .

[2]  L. Goddard Information Theory , 1962, Nature.

[3]  Yang Liu,et al.  A new Q-learning algorithm based on the metropolis criterion , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  Satinder Singh Transfer of Learning by Composing Solutions of Elemental Sequential Tasks , 1992, Mach. Learn..

[5]  Jürgen Schmidhuber,et al.  Fast Online Q(λ) , 1998, Machine Learning.

[6]  C. Atkeson,et al.  Prioritized Sweeping : Reinforcement Learning withLess Data and Less Real , 1993 .

[7]  Djallel Bouneffouf,et al.  Finite-time analysis of the multi-armed bandit problem with known trend , 2016, 2016 IEEE Congress on Evolutionary Computation (CEC).

[8]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[9]  Nicky J Welton,et al.  Value of Information , 2015, Medical decision making : an international journal of the Society for Medical Decision Making.

[10]  David H. Wolpert,et al.  Bandit problems and the exploration/exploitation tradeoff , 1998, IEEE Trans. Evol. Comput..

[11]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[12]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[13]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[14]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[15]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[16]  Marco Wiering,et al.  Reinforcement Learning , 2014, Adaptation, Learning, and Optimization.

[17]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[18]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Jie Chen,et al.  Optimal Contraction Theorem for Exploration–Exploitation Tradeoff in Search and Optimization , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[21]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[22]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[23]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[24]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[25]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..