Better Rates for Any Adversarial Deterministic MDP

We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the state-of-the-art forward in two ways: First, it attains a regret of O(T2/3) with respect to the best fixed policy in hindsight, whereas the previous best regret bound was O(T3/4). Second, the algorithm and its analysis are compatible with any feasible ADMDP graph topology, while all previous approaches required additional restrictions on the graph topology.

[1]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[2]  Ronald Ortner,et al.  Online Regret Bounds for Markov Decision Processes with Deterministic Transitions , 2008, ALT.

[3]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[4]  Y. O. Hamidoune,et al.  The Diophantine Frobenius Problem , 2006 .

[5]  Benjamin Van Roy,et al.  Universal Reinforcement Learning , 2007, IEEE Transactions on Information Theory.

[6]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[7]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[8]  Manfred K. Warmuth,et al.  Path Kernels and Multiplicative Updates , 2002, J. Mach. Learn. Res..

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[10]  Elad Hazan,et al.  Interior-Point Methods for Full-Information and Bandit Online Learning , 2012, IEEE Transactions on Information Theory.

[11]  Ambuj Tewari,et al.  Deterministic MDPs with Adversarial Rewards and Bandit Feedback , 2012, UAI.

[12]  Eugene A. Feinberg,et al.  On polynomial cases of the unichain classification problem for Markov Decision Processes , 2008, Oper. Res. Lett..

[13]  Gábor Lugosi,et al.  Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[14]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[15]  Shie Mannor,et al.  Arbitrarily modulated Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[16]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[17]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[18]  Eric V. Denardo,et al.  Periods of Connected Networks and Powers of Nonnegative Matrices , 1977, Math. Oper. Res..