论文信息 - Better Rates for Any Adversarial Deterministic MDP

Better Rates for Any Adversarial Deterministic MDP

We consider regret minimization in adversarial deterministic Markov Decision Processes (ADMDPs) with bandit feedback. We devise a new algorithm that pushes the state-of-the-art forward in two ways: First, it attains a regret of O(T2/3) with respect to the best fixed policy in hindsight, whereas the previous best regret bound was O(T3/4). Second, the algorithm and its analysis are compatible with any feasible ADMDP graph topology, while all previous approaches required additional restrictions on the graph topology.

Elad Hazan | Ofer Dekel | Elad Hazan | O. Dekel

[1] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[2] Ronald Ortner,et al. Online Regret Bounds for Markov Decision Processes with Deterministic Transitions , 2008, ALT.

[3] Thomas P. Hayes,et al. The Price of Bandit Information for Online Optimization , 2007, NIPS.

[4] Y. O. Hamidoune,et al. The Diophantine Frobenius Problem , 2006 .

[5] Benjamin Van Roy,et al. Universal Reinforcement Learning , 2007, IEEE Transactions on Information Theory.

[6] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Vol. II , 1976 .

[7] Csaba Szepesvári,et al. Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[8] Manfred K. Warmuth,et al. Path Kernels and Multiplicative Updates , 2002, J. Mach. Learn. Res..

[9] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[10] Elad Hazan,et al. Interior-Point Methods for Full-Information and Bandit Online Learning , 2012, IEEE Transactions on Information Theory.

[11] Ambuj Tewari,et al. Deterministic MDPs with Adversarial Rewards and Bandit Feedback , 2012, UAI.

[12] Eugene A. Feinberg,et al. On polynomial cases of the unichain classification problem for Markov Decision Processes , 2008, Oper. Res. Lett..

[13] Gábor Lugosi,et al. Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[14] Shie Mannor,et al. Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[15] Shie Mannor,et al. Arbitrarily modulated Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[16] John Odentrantz,et al. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[17] Yishay Mansour,et al. Online Markov Decision Processes , 2009, Math. Oper. Res..

[18] Eric V. Denardo,et al. Periods of Connected Networks and Powers of Nonnegative Matrices , 1977, Math. Oper. Res..