Blackwell Online Learning for Markov Decision Processes

This work provides a novel interpretation of Markov Decision Processes (MDP) from the online optimization viewpoint. In such an online optimization context, the policy of the MDP is viewed as the decision variable while the corresponding value function is treated as payoff feedback from the environment. Based on this interpretation, we construct a Blackwell game induced by MDP, which bridges the gap among regret minimization, Blackwell approachability theory, and learning theory for MDP. Specifically, from the approachability theory, we propose 1) Blackwell value iteration for offline planning and 2) Blackwell Q−learning for online learning in MDP, both of which are shown to converge to the optimal solution. Our theoretical guarantees are corroborated by numerical experiments.

[1]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[2]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[3]  Josef Hofbauer,et al.  Stochastic Approximations and Differential Inclusions , 2005, SIAM J. Control. Optim..

[4]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[5]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[6]  Peter L. Bartlett,et al.  Blackwell Approachability and No-Regret Learning are Equivalent , 2010, COLT.

[7]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[8]  Paulo Martins Engel,et al.  Dealing with non-stationary environments using context detection , 2006, ICML.

[9]  Erwan Lecarpentier,et al.  Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning , 2019, NeurIPS.

[10]  D. Leslie,et al.  Asynchronous stochastic approximation with differential inclusions , 2011, 1112.2288.

[11]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[12]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[13]  Josef Hofbauer,et al.  Stochastic Approximations and Differential Inclusions, Part II: Applications , 2006, Math. Oper. Res..

[14]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[15]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[16]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[17]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[19]  Sergiu Hart,et al.  Regret-based continuous-time dynamics , 2003, Games Econ. Behav..

[20]  Ian A. Kash,et al.  Combining No-regret and Q-learning , 2019, AAMAS.

[21]  Quanyan Zhu,et al.  On Convergence Rate of Adaptive Multiscale Value Function Approximation for Reinforcement Learning , 2019, 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP).

[22]  Aleksandrs Slivkins,et al.  Introduction to Multi-Armed Bandits , 2019, Found. Trends Mach. Learn..

[23]  R Bellman,et al.  On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Michael L. Littman,et al.  Cyclic Equilibria in Markov Games , 2005, NIPS.

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  Vianney Perchet,et al.  Approachability, Regret and Calibration; implications and equivalences , 2013, ArXiv.

[27]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..