Online Learning in Markov Decision Processes with Changing Cost Sequences

In this paper we consider online learning in finite Markov decision processes (MDPs) with changing cost sequences under full and bandit-information. We propose to view this problem as an instance of online linear optimization. We propose two methods for this problem: MD2 (mirror descent with approximate projections) and the continuous exponential weights algorithm with Dikin walks. We provide a rigorous complexity analysis of these techniques, while providing near-optimal regret-bounds (in particular, we take into account the computational costs of performing approximate projections in MD2). In the case of full-information feedback, our results complement existing ones. In the case of bandit-information feedback we consider the online stochastic shortest path problem, a special case of the above MDP problems, and manage to improve the existing results by removing the previous restrictive assumption that the state-visitation probabilities are uniformly bounded away from zero under all policies.

[1]  A. S. Manne Linear Programming and Sequential Decisions , 1960 .

[2]  Yinyu Ye,et al.  A Quadratically Convergent Polynomial Algorithm for Solving Entropy Optimization Problems , 1993, SIAM J. Optim..

[3]  Dick den Hertog,et al.  Interior Point Approach to Linear, Quadratic and Convex Programming: Algorithms and Complexity , 1994 .

[4]  Vivek S. Borkar,et al.  Convex Analytic Methods in Markov Decision Processes , 2002 .

[5]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[6]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[7]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[8]  A. Nemirovski Advances in convex optimization : conic programming , 2005 .

[9]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[10]  Tamás Linder,et al.  The On-Line Shortest Path Problem Under Partial Monitoring , 2007, J. Mach. Learn. Res..

[11]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[12]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[13]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[14]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[15]  Csaba Szepesvari,et al.  The Online Loop-free Stochastic Shortest-Path Problem , 2010, Annual Conference Computational Learning Theory.

[16]  Hariharan Narayanan,et al.  Random Walk Approach to Regret Minimization , 2010, NIPS.

[17]  Yichuan Zhang,et al.  Advances in Neural Information Processing Systems 25 , 2012 .

[18]  Sham M. Kakade,et al.  Towards Minimax Policies for Online Linear Optimization with Bandit Feedback , 2012, COLT.

[19]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[20]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[21]  Gergely Neu,et al.  An Efficient Algorithm for Learning with Semi-bandit Feedback , 2013, ALT.

[22]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[23]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[24]  Wang Feng,et al.  Online Learning Algorithms for Big Data Analytics: A Survey , 2015 .

[25]  Christin Wirth,et al.  Entropy Optimization And Mathematical Programming , 2016 .