Online learning with dynamics: A minimax perspective

We study the problem of online learning with dynamics, where a learner interacts with a stateful environment over multiple rounds. In each round of the interaction, the learner selects a policy to deploy and incurs a cost that depends on both the chosen policy and current state of the world. The state-evolution dynamics and the costs are allowed to be time-varying, in a possibly adversarial way. In this setting, we study the problem of minimizing policy regret and provide non-constructive upper bounds on the minimax rate for the problem. Our main results provide sufficient conditions for online learnability for this setup with corresponding rates. The rates are characterized by 1) a complexity term capturing the expressiveness of the underlying policy class under the dynamics of state change, and 2) a dynamics stability term measuring the deviation of the instantaneous loss from a certain counterfactual loss. Further, we provide matching lower bounds which show that both the complexity terms are indeed necessary. Our approach provides a unifying analysis that recovers regret bounds for several well studied problems including online learning with memory, online control of linear quadratic regulators, online Markov decision processes, and tracking adversarial targets. In addition, we show how our tools help obtain tight regret bounds for a new problems (with non-linear dynamics and non-convex losses) for which such bounds were not known prior to our work.

[1]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[2]  Sham M. Kakade,et al.  Online Control with Adversarial Disturbances , 2019, ICML.

[3]  Max Simchowitz,et al.  Naive Exploration is Optimal for Online LQR , 2020, ICML.

[4]  Ambuj Tewari,et al.  Online learning via sequential complexities , 2010, J. Mach. Learn. Res..

[5]  Max Simchowitz,et al.  Logarithmic Regret for Adversarial Online Control , 2020, ICML.

[6]  Karan Singh,et al.  Logarithmic Regret for Online Control , 2019, NeurIPS.

[7]  Amin Karbasi,et al.  Minimax Regret of Switching-Constrained Online Convex Optimization: No Phase Transition , 2020, NeurIPS.

[8]  Donald E. Kirk,et al.  Optimal control theory : an introduction , 1970 .

[9]  Shai Ben-David,et al.  Agnostic Online Learning , 2009, COLT.

[10]  Adam Tauman Kalai,et al.  The Isotron Algorithm: High-Dimensional Isotonic Regression , 2009, COLT.

[11]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[12]  Karthik Sridharan,et al.  Statistical Learning and Sequential Prediction , 2014 .

[13]  Avinatan Hassidim,et al.  Online Linear Quadratic Control , 2018, ICML.

[14]  Varun Kanade,et al.  Tracking Adversarial Targets , 2014, ICML.

[15]  Karthik Sridharan,et al.  Competing With Strategies , 2013, COLT.

[16]  J. Webster,et al.  Wiley Encyclopedia of Electrical and Electronics Engineering , 2010 .

[17]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[18]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[19]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[20]  Ambuj Tewari,et al.  Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret , 2012, ICML.

[21]  Ambuj Tewari,et al.  Online Learning: Random Averages, Combinatorial Parameters, and Learnability , 2010, NIPS.

[22]  Neri Merhav,et al.  On sequential strategies for loss functions with memory , 2002, IEEE Trans. Inf. Theory.

[23]  Biao Huang,et al.  System Identification , 2000, Control Theory for Physicists.

[24]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[25]  Tengyu Ma,et al.  Gradient Descent Learns Linear Dynamical Systems , 2016, J. Mach. Learn. Res..

[26]  A. Nobel A Counterexample Concerning Uniform Ergodic Theorems for a Class of Functions , 1995 .

[27]  Shie Mannor,et al.  Online Learning for Adversaries with Memory: Price of Past Mistakes , 2015, NIPS.

[28]  Praneeth Netrapalli,et al.  Online Non-Convex Learning: Following the Perturbed Leader is Optimal , 2019, ALT.

[29]  Robert F. Stengel,et al.  Optimal Control and Estimation , 1994 .

[30]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .