The Online Loop-free Stochastic Shortest-Path Problem

We consider a stochastic extension of the loop-free shortest path problem with adversarial rewards. In this episodicMarkov decision problem an agent traverses through an acyclic graph with random transitions: at each step of an episode the agent chooses an action, receives some reward, and arrives at a random next state, where the reward and the distribution of the next state depend on the actual state and the chosen action. We consider the bandit situation when only the reward of the just visited state-action pair is revealed to the agent. For this problem we develop algorithms that perform asymptotically as well as the best stationary policy in hindsight. Assuming that all states are reachable with probability a>0 under all policies, we give an algorithm and prove that its regret is O(L^2 sqrt(T|A|)/a), where T is the number of episodes, A denotes the (finite) set of actions, and L is the length of the longest path in the graph. Variants of the algorithm are given that improve the dependence on the transition probabilities under specific conditions. The results are also extended to variations of the problem, including the case when the agent competes with time varying policies.

[1]  Shie Mannor,et al.  Arbitrarily modulated Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[2]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[3]  Shie Mannor,et al.  Online learning in Markov decision processes with arbitrarily changing rewards and transitions , 2009, 2009 International Conference on Game Theory for Networks.

[4]  Dimitri P. Bertsekas,et al.  Neuro-Dynamic Programming , 2009, Encyclopedia of Optimization.

[5]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[6]  Thomas P. Hayes,et al.  High-Probability Regret Bounds for Bandit Online Linear Optimization , 2008, COLT.

[7]  Tamás Linder,et al.  Tracking the Best Quantizer , 2005, IEEE Transactions on Information Theory.

[8]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[9]  Tamás Linder,et al.  The On-Line Shortest Path Problem Under Partial Monitoring , 2007, J. Mach. Learn. Res..

[10]  Peter Auer,et al.  Hannan Consistency in On-Line Learning in Case of Unbounded Losses Under Partial Monitoring , 2006, ALT.

[11]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[14]  Avrim Blum,et al.  Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary , 2004, COLT.

[15]  Baruch Awerbuch,et al.  Adaptive routing with end-to-end feedback: distributed learning and geometric approaches , 2004, STOC '04.

[16]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[17]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[18]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[19]  Neri Merhav,et al.  Low-complexity sequential lossless coding for piecewise-stationary memoryless sources , 1998, IEEE Trans. Inf. Theory.

[20]  Vladimir Vovk,et al.  Derandomizing Stochastic Prediction Strategies , 1997, COLT '97.

[21]  Frans M. J. Willems,et al.  Coding for a binary independent piecewise-identically-distributed source , 1996, IEEE Trans. Inf. Theory.

[22]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[23]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[24]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[25]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[26]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.