Adversarially Regularized Policy Learning Guided by Trajectory Optimization

Recent advancement in combining trajectory optimization with function approximation (especially neural networks) shows promise in learning complex control policies for diverse tasks in robot systems. Despite their great flexibility, the large neural networks for parameterizing control policies impose significant challenges. The learned neural control policies are often overcomplex and nonsmooth, which is inconsistent with the fact that optimal control policies are smooth with respect to state for most robotic systems. Therefore, they often yield poor generalization performance in practice. To address this issue, we propose adVErsarially Regularized pOlicy learNIng guided by trajeCtory optimizAtion (VERONICA) for learning smooth control policies. Specifically, our proposed approach controls the smoothness (local Lipschitz continuity) of the neural control policies by stabilizing the output control with respect to the worst-case perturbation to the input state. Our experiments on robot manipulation show that our proposed approach not only improves the sample efficiency of neural policy learning but also enhances the robustness of the policy against various types of disturbances, including sensor noise, environmental uncertainty, and model mismatch.

[1]  Marco Pavone,et al.  SEAGuL: Sample Efficient Adversarially Guided Learning of Value Functions , 2021, L4DC.

[2]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[3]  Yann Chevaleyre,et al.  Online Trajectory Planning Through Combined Trajectory Optimization and Function Approximation: Application to the Exoskeleton Atalante , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[4]  Xiaodong Liu,et al.  Adversarial Training as Stackelberg Game: An Unrolled Optimization Approach , 2021, ArXiv.

[5]  Artem Molchanov,et al.  Generalized Inner Loop Meta-Learning , 2019, ArXiv.

[6]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[7]  Michael I. Jordan,et al.  Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[8]  Yang Yu,et al.  On Value Discrepancy of Imitation Learning , 2019, ArXiv.

[9]  J. Betts Survey of Numerical Methods for Trajectory Optimization , 1998 .

[10]  Scott Kuindersma,et al.  Optimization-based locomotion planning, estimation, and control design for the atlas humanoid robot , 2015, Autonomous Robots.

[11]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[12]  Dawn Song,et al.  Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty , 2019, NeurIPS.

[13]  Ye Zhao,et al.  Accelerated ADMM based Trajectory Optimization for Legged Locomotion with Coupled Rigid Body Dynamics , 2020, 2020 American Control Conference (ACC).

[14]  Patric Jensfelt,et al.  Adversarial Feature Training for Generalizable Robotic Visuomotor Control , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Sergey Levine,et al.  Learning Complex Neural Network Policies with Trajectory Optimization , 2014, ICML.

[16]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[17]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[18]  Sham M. Kakade,et al.  Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[19]  Cyrus Rashtchian,et al.  Adversarial Robustness Through Local Lipschitzness , 2020, ArXiv.

[20]  Kavosh Asadi,et al.  Lipschitz Continuity in Model-based Reinforcement Learning , 2018, ICML.

[21]  Russ Tedrake,et al.  A direct method for trajectory optimization of rigid bodies through contact , 2014, Int. J. Robotics Res..

[22]  Jianfeng Gao,et al.  SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2019, ACL.

[23]  Emanuel Todorov,et al.  Combining the benefits of function approximation and trajectory optimization , 2014, Robotics: Science and Systems.

[24]  Quoc V. Le,et al.  Unsupervised Data Augmentation , 2019, ArXiv.

[25]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[26]  Jun Morimoto,et al.  Robust Reinforcement Learning , 2005, Neural Computation.

[27]  Nicolas Mansard,et al.  Crocoddyl: An Efficient and Versatile Framework for Multi-Contact Optimal Control , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[28]  Ye Zhao,et al.  Robust Trajectory Optimization Over Uncertain Terrain With Stochastic Complementarity , 2020, IEEE Robotics and Automation Letters.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[33]  David J. Fleet,et al.  Estimating contact dynamics , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[34]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[35]  Thomas A. Henzinger,et al.  Adversarial Training is Not Ready for Robot Learning , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[36]  Ye Zhao,et al.  SyDeBO: Symbolic-Decision-Embedded Bilevel Optimization for Long-Horizon Manipulation in Dynamic Environments , 2020, IEEE Access.

[37]  Olvi L. Mangasarian,et al.  A class of smoothing functions for nonlinear and mixed complementarity problems , 1996, Comput. Optim. Appl..

[38]  Zoran Popovic,et al.  Discovery of complex behaviors through contact-invariant optimization , 2012, ACM Trans. Graph..

[39]  Emanuel Todorov,et al.  A convex, smooth and invertible contact model for trajectory optimization , 2011, 2011 IEEE International Conference on Robotics and Automation.

[40]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[41]  Stefan Schaal,et al.  Learning from Demonstration , 1996, NIPS.

[42]  Misha Denil,et al.  Task-Relevant Adversarial Imitation Learning , 2019, CoRL.

[43]  Yann LeCun,et al.  Transformation Invariance in Pattern Recognition-Tangent Distance and Tangent Propagation , 1996, Neural Networks: Tricks of the Trade.

[44]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[45]  Tuo Zhao,et al.  Deep Reinforcement Learning with Robust and Smooth Policy , 2020, ICML.

[46]  Yuval Tassa,et al.  Control-limited differential dynamic programming , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).