Guided Policy Exploration for Markov Decision Processes Using an Uncertainty-Based Value-of-Information Criterion

Reinforcement learning in environments with many action–state pairs is challenging. The issue is the number of episodes needed to thoroughly search the policy space. Most conventional heuristics address this search problem in a stochastic manner. This can leave large portions of the policy space unvisited during the early training stages. In this paper, we propose an uncertainty-based, information-theoretic approach for performing guided stochastic searches that more effectively cover the policy space. Our approach is based on the value of information, a criterion that provides the optimal tradeoff between expected costs and the granularity of the search process. The value of information yields a stochastic routine for choosing actions during learning that can explore the policy space in a coarse to fine manner. We augment this criterion with a state-transition uncertainty factor, which guides the search process into previously unexplored regions of the policy space. We evaluate the uncertainty-based value-of-information policies on the games Centipede and Crossy Road. Our results indicate that our approach yields better performing policies in fewer episodes than stochastic-based exploration strategies. We show that the training rate for our approach can be further improved by using the policy cross entropy to guide our criterion’s hyperparameter selection.

[1]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[2]  Lihong Li,et al.  Incremental Model-based Learners With Formal Learning-Time Guarantees , 2006, UAI.

[3]  Tingwen Huang,et al.  Model-Free Optimal Tracking Control via Critic-Only Q-Learning , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[4]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[5]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[6]  José Carlos Príncipe,et al.  A model based approach to exploration of continuous-state MDPs using Divergence-to-Go , 2015, 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP).

[7]  José Carlos Príncipe,et al.  Balancing exploration and exploitation in reinforcement learning using a value of information criterion , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[9]  Geoffrey C. Fox,et al.  Vector quantization by deterministic annealing , 1992, IEEE Trans. Inf. Theory.

[10]  Frank L. Lewis,et al.  Off-Policy Reinforcement Learning for Synchronization in Multiagent Graphical Games , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[12]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Huaguang Zhang,et al.  Adaptive Fault-Tolerant Tracking Control for MIMO Discrete-Time Systems via Reinforcement Learning Algorithm With Less Learning Parameters , 2017, IEEE Transactions on Automation Science and Engineering.

[15]  Andrew G. Barto,et al.  Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms , 1993, NIPS.

[16]  Lawrence K. Saul,et al.  Learning curve bounds for a Markov decision process with undiscounted rewards , 1996, COLT '96.

[17]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[18]  Joelle Pineau,et al.  A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes , 2011, J. Mach. Learn. Res..

[19]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[20]  José Carlos Príncipe,et al.  Analysis of Agent Expertise in Ms. Pac-Man Using Value-of-Information-Based Policies , 2017, IEEE Transactions on Games.

[21]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[22]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[23]  Nicky J Welton,et al.  Value of Information , 2015, Medical decision making : an international journal of the Society for Medical Decision Making.

[24]  Deniz Erdogmus,et al.  Information Theoretic Learning , 2005, Encyclopedia of Artificial Intelligence.

[25]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[26]  Huaguang Zhang,et al.  Neural-Network-Based Robust Optimal Tracking Control for MIMO Discrete-Time Systems With Unknown Uncertainty Using Adaptive Critic Design , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[27]  Panos M. Pardalos,et al.  Reinforcement Learning in Video Games Using Nearest Neighbor Interpolation and Metric Learning , 2016, IEEE Transactions on Computational Intelligence and AI in Games.

[28]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[29]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[30]  KearnsMichael,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002 .

[31]  Joelle Pineau,et al.  Model-Based Bayesian Reinforcement Learning in Large Structured Domains , 2008, UAI.

[32]  Claude-Nicolas Fiechter Expected Mistake Bound Model for On-Line Reinforcement Learning , 1997, ICML.

[33]  Ronen I. Brafman,et al.  A near-optimal polynomial time algorithm for learning in certain classes of stochastic games , 2000, Artif. Intell..

[34]  Steven D. Whitehead,et al.  Complexity and Cooperation in Q-Learning , 1991, ML.

[35]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[36]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[37]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[38]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[39]  José Carlos Príncipe,et al.  Partitioning Relational Matrices of Similarities or Dissimilarities Using the Value of Information , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Dongbin Zhao,et al.  Iterative Adaptive Dynamic Programming for Solving Unknown Nonlinear Zero-Sum Game Based on Online Data , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[41]  Manfred K. Warmuth,et al.  On the Worst-Case Analysis of Temporal-Difference Learning Algorithms , 2005, Machine Learning.

[42]  Chong Li,et al.  Model-Free Reinforcement Learning , 2019, Reinforcement Learning for Cyber-Physical Systems.

[43]  José Carlos Príncipe,et al.  An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits , 2017, Entropy.

[44]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.