Explaining fast improvement in online imitation learning

Online imitation learning (IL) is an algorithmic framework that leverages interactions with expert policies for efficient policy optimization. Here policies are optimized by performing online learning on a sequence of loss functions that encourage the learner to mimic expert actions, and if the online learning has no regret, the agent can provably learn an expert-like policy. Online IL has demonstrated empirical successes in many applications and interestingly, its policy improvement speed observed in practice is usually much faster than existing theory suggests. In this work, we provide an explanation of this phenomenon. Let $\xi$ denote the policy class bias and assume the online IL loss functions are convex, smooth, and non-negative. We prove that, after $N$ rounds of online IL with stochastic feedback, the policy improves in $\tilde{O}(1/N + \sqrt{\xi/N})$ in both expectation and high probability. In other words, we show that adopting a sufficiently expressive policy class in online IL has two benefits: both the policy improvement speed increases and the performance bias decreases.

[1]  Byron Boots,et al.  Continuous Online Learning and New Insights to Online Imitation Learning , 2019, ArXiv.

[2]  Mingrui Liu,et al.  Fast Rates of ERM and Stochastic Approximation: Adaptive to Error Bound Conditions , 2018, NeurIPS.

[3]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[4]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[5]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[6]  J. Andrew Bagnell,et al.  Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[7]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[8]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Masahiro Ono,et al.  Learning to Search via Retrospective Imitation , 2018, 1804.00846.

[10]  Byron Boots,et al.  Agile Autonomous Driving using End-to-End Deep Imitation Learning , 2017, Robotics: Science and Systems.

[11]  Lorenz Wellhausen,et al.  Learning quadrupedal locomotion over challenging terrain , 2020, Science Robotics.

[12]  Anca D. Dragan,et al.  DART: Noise Injection for Robust Imitation Learning , 2017, CoRL.

[13]  Ambuj Tewari,et al.  Smoothness, Low Noise and Fast Rates , 2010, NIPS.

[14]  Sebastian Scherer,et al.  Learning Heuristic Search via Imitation , 2017, CoRL.

[15]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[16]  Ambuj Tewari,et al.  On the Generalization Ability of Online Strongly Convex Programming Algorithms , 2008, NIPS.

[17]  John Langford,et al.  Learning to Search Better than Your Teacher , 2015, ICML.

[18]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[19]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[20]  Rong Jin,et al.  Empirical Risk Minimization for Stochastic Convex Optimization: $O(1/n)$- and $O(1/n^2)$-type of Risk Bounds , 2017, COLT.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[23]  Siddhartha S. Srinivasa,et al.  DART: Dynamic Animation and Robotics Toolkit , 2018, J. Open Source Softw..

[24]  Martin J. Wainwright,et al.  Randomized Smoothing for Stochastic Optimization , 2011, SIAM J. Optim..

[25]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[26]  Byron Boots,et al.  Online Learning with Continuous Variations: Dynamic Regret and Reductions , 2020, AISTATS.

[27]  Peter L. Bartlett,et al.  POLITEX: Regret Bounds for Policy Iteration using Expert Prediction , 2019, ICML.

[28]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[29]  I. Pinelis OPTIMUM BOUNDS FOR THE DISTRIBUTIONS OF MARTINGALES IN BANACH SPACES , 1994, 1208.2200.

[30]  I JordanMichael,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008 .

[31]  Yuanzhi Li,et al.  Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits , 2018, ICML.

[32]  Jun Huang,et al.  Intelligent Edge Computing in Internet of Vehicles: A Joint Computation Offloading and Caching Solution , 2021, IEEE Transactions on Intelligent Transportation Systems.

[33]  Karthik Sridharan,et al.  On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities , 2015, COLT.

[34]  M. Hebert,et al.  DATA AS DEMONSTRATOR with Applications to System Identification , 2014 .

[35]  Francesco Orabona A Modern Introduction to Online Learning , 2019, ArXiv.

[36]  Claudio Gentile,et al.  Beyond Logarithmic Bounds in Online Learning , 2012, AISTATS.

[37]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[38]  Claude Lemaréchal,et al.  Practical Aspects of the Moreau-Yosida Regularization: Theoretical Preliminaries , 1997, SIAM J. Optim..

[39]  Anca D. Dragan,et al.  Robot grasping in clutter: Using a hierarchy of supervisors for learning from demonstrations , 2016, 2016 IEEE International Conference on Automation Science and Engineering (CASE).

[40]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[41]  Siddhartha Srinivasa,et al.  Imitation Learning as f-Divergence Minimization , 2019, WAFR.

[42]  Richard Zemel,et al.  A Divergence Minimization Perspective on Imitation Learning Methods , 2019, CoRL.

[43]  Zhiwei Steven Wu,et al.  Of Moments and Matching: Trade-offs and Treatments in Imitation Learning , 2021, ArXiv.

[44]  D. Panchenko Some Extensions of an Inequality of Vapnik and Chervonenkis , 2002, math/0405342.

[45]  Nolan Wagener,et al.  Fast Policy Learning through Imitation and Reinforcement , 2018, UAI.

[46]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[47]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[48]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[49]  Byron Boots,et al.  Accelerating Imitation Learning with Predictive Models , 2018, AISTATS.

[50]  H. Brendan McMahan,et al.  A survey of Algorithms and Analysis for Adaptive Online Learning , 2014, J. Mach. Learn. Res..

[51]  J. Andrew Bagnell,et al.  Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[52]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[53]  Marc Teboulle,et al.  A simplified view of first order methods for optimization , 2018, Math. Program..