Reinforcement learning and mistake bounded algorithms

Markov Decision Process (MDP) and Partially Observable MDP (POMDP) have become the model of choice in reinforcement learning. This work explores an interesting connection between mistake bounded learning algorithms and computing a near-best strategy, from a restricted class of strategies, for a given POMDP. We show that if a class of strategies has a mistake bound algorithm that makes at most d mistakes, then there is an algorithm to compute a near-best strategy from the class in time polynomial in l/c, the accuracy parameter, log(1/6), the confidence parameter, H, the horizon parameter, and exponential in d, the mistake bound. Our transformation assumes only the ability to execute actions in the POMDP and the ability to reset the POMDP to its initial state.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[5]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[6]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[7]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[9]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[10]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[11]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[12]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[13]  R. Bellman Dynamic programming. , 1957, Science.

[14]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[15]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[16]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .