Dynamic regret convergence analysis and an adaptive regularization algorithm for on-policy robot imitation learning

On-policy imitation learning algorithms such as DAgger evolve a robot control policy by executing it, measuring performance (loss), obtaining corrective feedback from a supervisor, and generating the next policy. As the loss between iterations can vary unpredictably, a fundamental question is under what conditions this process will eventually achieve a converged policy. If one assumes the underlying trajectory distribution is static (stationary), it is possible to prove convergence for DAgger. However, in more realistic models for robotics, the underlying trajectory distribution is dynamic because it is a function of the policy. Recent results show it is possible to prove convergence of DAgger when a regularity condition on the rate of change of the trajectory distributions is satisfied. In this article, we reframe this result using dynamic regret theory from the field of online optimization and show that dynamic regret can be applied to any on-policy algorithm to analyze its convergence and optimality. These results inspire a new algorithm, Adaptive On-Policy Regularization (AOR), that ensures the conditions for convergence. We present simulation results with cart-pole balancing and locomotion benchmarks that suggest AOR can significantly decrease dynamic regret and chattering as the robot learns. To our knowledge, this the first application of dynamic regret theory to imitation learning.

[1]  Byron Boots,et al.  Online Learning with Continuous Variations: Dynamic Regret and Reductions , 2020, AISTATS.

[2]  Michael Laskey,et al.  Stability Analysis of On-Policy Imitation Learning Algorithms Using Dynamic Regret , 2018 .

[3]  S. Banach Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales , 1922 .

[4]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[5]  Pieter Abbeel,et al.  An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[6]  Byron Boots,et al.  Agile Off-Road Autonomous Driving Using End-to-End Deep Imitation Learning , 2017, ArXiv.

[7]  Aryan Mokhtari,et al.  Optimization in Dynamic Environments : Improved Regret Rates for Strongly Convex Problems , 2016 .

[8]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[9]  Nolan Wagener,et al.  Fast Policy Learning through Imitation and Reinforcement , 2018, UAI.

[10]  Karthik Sridharan,et al.  Optimization, Learning, and Games with Predictable Sequences , 2013, NIPS.

[11]  D. Bertsekas Convergence of discretization procedures in dynamic programming , 1975 .

[12]  Sham M. Kakade,et al.  Mind the Duality Gap: Logarithmic regret algorithms for online optimization , 2008, NIPS.

[13]  Byron Boots,et al.  Accelerating Imitation Learning with Predictive Models , 2018, AISTATS.

[14]  Yisong Yue,et al.  Smooth Imitation Learning for Online Sequence Prediction , 2016, ICML.

[15]  Michael David Laskey,et al.  On and Off-Policy Deep Imitation Learning for Robotics , 2018 .

[16]  Felix Duvallet,et al.  Imitation learning for natural language direction following through unknown environments , 2013, 2013 IEEE International Conference on Robotics and Automation.

[17]  Wouter M. Koolen,et al.  A Closer Look at Adaptive Regret , 2012, J. Mach. Learn. Res..

[18]  Marcin Andrychowicz,et al.  One-Shot Imitation Learning , 2017, NIPS.

[19]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[20]  Shahin Shahrampour,et al.  Online Optimization : Competing with Dynamic Comparators , 2015, AISTATS.

[21]  Martial Hebert,et al.  Learning monocular reactive UAV control in cluttered natural environments , 2012, 2013 IEEE International Conference on Robotics and Automation.

[22]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[23]  Siddhartha Srinivasa,et al.  Imitation Learning as f-Divergence Minimization , 2019, WAFR.

[24]  Byron Boots,et al.  Agile Autonomous Driving using End-to-End Deep Imitation Learning , 2017, Robotics: Science and Systems.

[25]  Kyunghyun Cho,et al.  Query-Efficient Imitation Learning for End-to-End Simulated Driving , 2017, AAAI.

[26]  Ajay Kumar Tanwani,et al.  A Dynamic Regret Analysis and Adaptive Regularization Algorithm for On-Policy Robot Imitation Learning , 2018, WAFR.

[27]  Mohamed Medhat Gaber,et al.  Deep imitation learning for 3D navigation tasks , 2017, Neural Computing and Applications.

[28]  J A Bagnell,et al.  An Invitation to Imitation , 2015 .

[29]  S. Sastry Nonlinear Systems: Analysis, Stability, and Control , 1999 .

[30]  Jinfeng Yi,et al.  Tracking Slowly Moving Clairvoyant: Optimal Dynamic Regret of Online Learning with True and Noisy Gradient , 2016, ICML.

[31]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[32]  Sébastien Bubeck,et al.  Introduction to Online Optimization , 2011 .

[33]  M. Fukushima Merit Functions for Variational Inequality and Complementarity Problems , 1996 .

[34]  Rebecca Willett,et al.  Online Convex Optimization in Dynamic Environments , 2015, IEEE Journal of Selected Topics in Signal Processing.

[35]  F. Facchinei,et al.  Finite-Dimensional Variational Inequalities and Complementarity Problems , 2003 .

[36]  Ken Goldberg,et al.  Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation , 2017, ICRA.

[37]  Karthik Sridharan,et al.  Statistical Learning and Sequential Prediction , 2014 .

[38]  Seshadhri Comandur,et al.  Electronic Colloquium on Computational Complexity, Report No. 88 (2007) Adaptive Algorithms for Online Decision Problems , 2022 .

[39]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[40]  Byron Boots,et al.  Convergence of Value Aggregation for Imitation Learning , 2018, AISTATS.

[41]  P. Olver Nonlinear Systems , 2013 .

[42]  Anca D. Dragan,et al.  Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[44]  Luca Bascetta,et al.  Policy gradient in Lipschitz Markov Decision Processes , 2015, Machine Learning.

[45]  Jinfeng Yi,et al.  Improved Dynamic Regret for Non-degenerate Functions , 2016, NIPS.

[46]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[47]  Kavosh Asadi,et al.  Lipschitz Continuity in Model-based Reinforcement Learning , 2018, ICML.

[48]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[49]  Karl Hinderer,et al.  Lipschitz Continuity of Value Functions in Markovian Decision Processes , 2005, Math. Methods Oper. Res..