Making Non-Stochastic Control (Almost) as Easy as Stochastic

Recent literature has made much progress in understanding \emph{online LQR}: a modern learning-theoretic take on the classical control problem in which a learner attempts to optimally control an unknown linear dynamical system with fully observed state, perturbed by i.i.d. Gaussian noise. It is now understood that the optimal regret on time horizon $T$ against the optimal control law scales as $\widetilde{\Theta}(\sqrt{T})$. In this paper, we show that the same regret rate (against a suitable benchmark) is attainable even in the considerably more general non-stochastic control model, where the system is driven by \emph{arbitrary adversarial} noise (Agarwal et al. 2019). In other words, \emph{stochasticity confers little benefit in online LQR}. We attain the optimal $\widetilde{\mathcal{O}}(\sqrt{T})$ regret when the dynamics are unknown to the learner, and $\mathrm{poly}(\log T)$ regret when known, provided that the cost functions are strongly convex (as in LQR). Our algorithm is based on a novel variant of online Newton step (Hazan et al. 2007), which adapts to the geometry induced by possibly adversarial disturbances, and our analysis hinges on generic "policy regret" bounds for certain structured losses in the OCO-with-memory framework (Anava et al. 2015). Moreover, our results accomodate the full generality of the non-stochastic control setting: adversarially chosen (possibly non-quadratic) costs, partial state observation, and fully adversarial process and observation noise.

[1]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[2]  Yishay Mansour,et al.  Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret , 2019, ICML.

[3]  Max Simchowitz,et al.  Learning Linear Dynamical Systems with Semi-Parametric Least Squares , 2019, COLT.

[4]  Ambuj Tewari,et al.  Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret , 2012, ICML.

[5]  Benjamin Recht,et al.  Certainty Equivalence is Efficient for Linear Quadratic Control , 2019, NeurIPS.

[6]  Kunal Talwar,et al.  Online learning over a finite action set with limited switching , 2018, COLT.

[7]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[8]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[9]  Robert F. Stengel,et al.  Optimal Control and Estimation , 1994 .

[10]  Avinatan Hassidim,et al.  Online Linear Quadratic Control , 2018, ICML.

[11]  Karan Singh,et al.  Logarithmic Regret for Online Control , 2019, NeurIPS.

[12]  Na Li,et al.  Online Optimal Control with Linear Dynamics and Predictions: Algorithms and Regret Analysis , 2019, NeurIPS.

[13]  Karthik Sridharan,et al.  Online Non-Parametric Regression , 2014, COLT.

[14]  Nikolai Matni,et al.  Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator , 2018, NeurIPS.

[15]  Dante C. Youla,et al.  Modern Wiener-Hopf Design of Optimal Controllers. Part I , 1976 .

[16]  Sham M. Kakade,et al.  The Nonstochastic Control Problem , 2020, ALT.

[17]  Babak Hassibi,et al.  Logarithmic Regret Bound in Partially Observable Linear Dynamical Systems , 2020, NeurIPS.

[18]  Sham M. Kakade,et al.  Online Control with Adversarial Disturbances , 2019, ICML.

[19]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[20]  Amin Karbasi,et al.  Minimax Regret of Switching-Constrained Online Convex Optimization: No Phase Transition , 2020, NeurIPS.

[21]  Babak Hassibi,et al.  Regret Minimization in Partially Observable Linear Quadratic Control , 2020, ArXiv.

[22]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[23]  Shie Mannor,et al.  Online Learning for Adversaries with Memory: Price of Past Mistakes , 2015, NIPS.

[24]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[25]  Max Simchowitz,et al.  Logarithmic Regret for Adversarial Online Control , 2020, ICML.

[26]  Varun Kanade,et al.  Tracking Adversarial Targets , 2014, ICML.

[27]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[28]  Max Simchowitz,et al.  Naive Exploration is Optimal for Online LQR , 2020, ICML.

[29]  Max Simchowitz,et al.  Improper Learning for Non-Stochastic Control , 2020, COLT.

[30]  Y. Halevi Stable LQG controllers , 1994, IEEE Trans. Autom. Control..

[31]  Alon Cohen,et al.  Logarithmic Regret for Learning Linear Quadratic Regulators Efficiently , 2020, ICML.

[32]  Yurii Nesterov,et al.  First-order methods of smooth convex optimization with inexact oracle , 2013, Mathematical Programming.

[33]  Yuval Peres,et al.  Bandits with switching costs: T2/3 regret , 2013, STOC.