Fast rates for online learning in Linearly Solvable Markov Decision Processes

We study the problem of online learning in a class of Markov decision processes known as linearly solvable MDPs. In the stationary version of this problem, a learner interacts with its environment by directly controlling the state transitions, attempting to balance a fixed state-dependent cost and a certain smooth cost penalizing extreme control inputs. In the current paper, we consider an online setting where the state costs may change arbitrarily between consecutive rounds, and the learner only observes the costs at the end of each respective round. We are interested in constructing algorithms for the learner that guarantee small regret against the best stationary control policy chosen in full knowledge of the cost sequence. Our main result is showing that the smoothness of the control cost enables the simple algorithm of following the leader to achieve a regret of order $\log^2 T$ after $T$ rounds, vastly improving on the best known regret bound of order $T^{3/4}$ for this setting.

[1]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[2]  Emanuel Todorov,et al.  Compositionality of optimal control laws , 2009, NIPS.

[3]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2008, Math. Oper. Res..

[4]  Emanuel Todorov,et al.  Policy gradients in linearly-solvable MDPs , 2010, NIPS.

[5]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[6]  H. Kappen Linear theory for control of nonlinear stochastic systems. , 2004, Physical review letters.

[7]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[8]  Alessandro Lazaric,et al.  Exploiting easy data in online optimization , 2014, NIPS.

[9]  Emanuel Todorov,et al.  General duality between optimal control and estimation , 2008, 2008 47th IEEE Conference on Decision and Control.

[10]  Wojciech Kotlowski,et al.  On Minimaxity of Follow the Leader Strategy in the Stochastic Setting , 2016, ALT.

[11]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[12]  Csaba Szepesv Algorithms for Reinforcement Learning , 2010 .

[13]  Manfred K. Warmuth,et al.  Averaging Expert Predictions , 1999, EuroCOLT.

[14]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[15]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[16]  Vicenç Gómez,et al.  Action selection in growing state spaces: Control of Network Structure Growth , 2016, ArXiv.

[17]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[18]  Csaba Szepesvári,et al.  Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.

[19]  Emanuel Todorov,et al.  Inverse Optimal Control with Linearly-Solvable MDPs , 2010, ICML.

[20]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  Neri Merhav,et al.  Universal sequential learning and decision from individual data sequences , 1992, COLT '92.

[23]  András György,et al.  The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.

[24]  James M. Rehg,et al.  Aggressive driving with model predictive path integral control , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Takamitsu Matsubara,et al.  Latent Kullback-Leibler control for dynamic imitation learning of whole-body behaviors in humanoid robots , 2016, 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids).

[26]  Gergely Neu,et al.  Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[27]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[28]  Vicenç Gómez,et al.  Real-Time Stochastic Optimal Control for Multi-Agent Quadrotor Systems , 2015, ICAPS.

[29]  Rebecca Willett,et al.  Online Markov Decision Processes With Kullback–Leibler Control Cost , 2014, IEEE Transactions on Automatic Control.

[30]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[31]  András György,et al.  Online Learning in Markov Decision Processes with Changing Cost Sequences , 2014, ICML.

[32]  Varun Kanade,et al.  Tracking Adversarial Targets , 2014, ICML.

[33]  Takamitsu Matsubara,et al.  Latent Kullback Leibler Control for Continuous-State Systems using Probabilistic Graphical Models , 2014, UAI.

[34]  Wouter M. Koolen,et al.  Follow the leader if you can, hedge if you must , 2013, J. Mach. Learn. Res..

[35]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[36]  Peter L. Bartlett,et al.  Large-Scale Markov Decision Problems with KL Control Cost and its Application to Crowdsourcing , 2015, ICML.

[37]  Y. Matsuoka,et al.  Reinforcement Learning and Synergistic Control of the ACT Hand , 2013, IEEE/ASME Transactions on Mechatronics.

[38]  Elad Hazan The convex optimization approach to regret minimization , 2011 .

[39]  E. Seneta Non-negative Matrices and Markov Chains , 2008 .

[40]  Vicenç Gómez,et al.  Policy Search for Path Integral Control , 2014, ECML/PKDD.

[41]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[42]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[43]  Mark D. Reid,et al.  Fast rates in statistical and online learning , 2015, J. Mach. Learn. Res..

[44]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[45]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[46]  Kenji Doya,et al.  Evaluation of linearly solvable Markov decision process with dynamic model learning in a mobile robot navigation task , 2013, Front. Neurorobot..

[47]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[48]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.