Policy Representation via Diffusion Probability Model for Reinforcement Learning

Popular reinforcement learning (RL) algorithms tend to produce a unimodal policy distribution, which weakens the expressiveness of complicated policy and decays the ability of exploration. The diffusion probability model is powerful to learn complicated multimodal distributions, which has shown promising and potential applications to RL. In this paper, we formally build a theoretical foundation of policy representation via the diffusion probability model and provide practical implementations of diffusion policy for online model-free RL. Concretely, we character diffusion policy as a stochastic process, which is a new approach to representing a policy. Then we present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy. Furthermore, we propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy. To the best of our knowledge, DIPO is the first algorithm to solve model-free online RL problems with the diffusion model. Finally, extensive empirical results show the effectiveness and superiority of DIPO on the standard continuous control Mujoco benchmark.

[1]  Rudolf Lioutikov,et al.  Goal-Conditioned Imitation Learning using Score-based Diffusion Policies , 2023, ArXiv.

[2]  Taco Cohen,et al.  EDGI: Equivariant Diffusion for Planning with Embodied Agents , 2023, ArXiv.

[3]  P. Abbeel,et al.  Foundation Models for Decision Making: Problems, Methods, and Opportunities , 2023, ArXiv.

[4]  Eric A. Cousineau,et al.  Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , 2023, ArXiv.

[5]  Jianye Hao,et al.  CFlowNets: Continuous Control with Generative Flow Networks , 2023, ICLR.

[6]  P. Abbeel,et al.  Preference Transformer: Modeling Human Preferences using Transformers for RL , 2023, ICLR.

[7]  Jinyin Chen,et al.  GAIL-PT: An intelligent penetration testing framework with generative adversarial imitation learning , 2023, Comput. Secur..

[8]  Utkarsh Aashu Mishra,et al.  ReorientDiff: Diffusion Model based Reorientation for Object Manipulation , 2023, ArXiv.

[9]  H. Zha,et al.  Diverse Policy Optimization for Structured Action Space , 2023, AAMAS.

[10]  Y. Bengio,et al.  Stochastic Generative Flow Networks , 2023, ArXiv.

[11]  M. Tomizuka,et al.  AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners , 2023, ICML.

[12]  Sergio Valcarcel Macua,et al.  Imitating Human Behaviour with Diffusion Models , 2023, ICLR.

[13]  Edward Johns,et al.  DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics , 2022, IEEE Robotics and Automation Letters.

[14]  Holden Lee,et al.  Convergence of score-based generative modeling for general data distributions , 2022, ALT.

[15]  S. Levine,et al.  RT-1: Robotics Transformer for Real-World Control at Scale , 2022, Robotics: Science and Systems.

[16]  J. Tenenbaum,et al.  Is Conditional Generative Modeling all you need for Decision-Making? , 2022, ArXiv.

[17]  Holden Lee,et al.  Improved Analysis of Score-based Generative Modeling: User-Friendly Bounds under Minimal Smoothness Assumptions , 2022, ICML.

[18]  Andre Wibisono,et al.  Convergence in KL Divergence of the Inexact Langevin Algorithm with Application to Score-based Generative Models , 2022, ArXiv.

[19]  Lerrel Pinto,et al.  From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data , 2022, ICLR.

[20]  Hang Su,et al.  Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling , 2022, ICLR.

[21]  Anru R. Zhang,et al.  Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions , 2022, ICLR.

[22]  J. Boedecker,et al.  Latent Plans for Task-Agnostic Offline Reinforcement Learning , 2022, CoRL.

[23]  Ming-Hsuan Yang,et al.  Diffusion Models: A Comprehensive Survey of Methods and Applications , 2022, ACM Computing Surveys.

[24]  Jonathan J. Hunt,et al.  Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning , 2022, ICLR.

[25]  Lerrel Pinto,et al.  Behavior Transformers: Cloning k modes with one stone , 2022, NeurIPS.

[26]  K. Sycara,et al.  ARC - Actor Residual Critic for Adversarial Imitation Learning , 2022, CoRL.

[27]  S. Levine,et al.  Planning with Diffusion for Flexible Behavior Synthesis , 2022, ICML.

[28]  Sergio Gomez Colmenarejo,et al.  A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[29]  Yongxin Chen,et al.  Fast Sampling of Diffusion Models with Exponential Integrator , 2022, ICLR.

[30]  Oier Mees,et al.  What Matters in Language Conditioned Robotic Imitation Learning Over Unstructured Data , 2022, IEEE Robotics and Automation Letters.

[31]  Amy Zhang,et al.  Online Decision Transformer , 2022, ICML.

[32]  Chen Sun,et al.  Trajectory Balance: Improved Credit Assignment in GFlowNets , 2022, NeurIPS.

[33]  Dieter Fox,et al.  Hierarchical Policies for Cluttered-Scene Grasping with Latent Plans , 2021, IEEE Robotics and Automation Letters.

[34]  Il-Chul Moon,et al.  Soft Truncation: A Universal Training Technique of Score-based Diffusion Model for High Precision Score Estimation , 2021, ICML.

[35]  S. Chernova,et al.  StructDiffusion: Object-Centric Diffusion for Semantic Rearrangement of Novel Objects , 2022, ArXiv.

[36]  M. Gombolay,et al.  Contrastive Decision Transformers , 2022, CoRL.

[37]  Jan Peters,et al.  SE(3)-DiffusionFields: Learning cost functions for joint grasp and motion optimization through diffusion , 2022 .

[38]  Jonathan Tompson,et al.  Implicit Behavioral Cloning , 2021, CoRL.

[39]  Doina Precup,et al.  Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation , 2021, NeurIPS.

[40]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[41]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[42]  Sergey Levine,et al.  Parrot: Data-Driven Behavioral Priors for Reinforcement Learning , 2020, ICLR.

[43]  S. Levine,et al.  OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning , 2020, ICLR.

[44]  Weinan Zhang,et al.  Energy-Based Imitation Learning , 2020, AAMAS.

[45]  Chang Xu,et al.  Learning to Weight Imperfect Demonstrations , 2021, ICML.

[46]  Joseph J. Lim,et al.  Accelerating Reinforcement Learning with Learned Skill Priors , 2020, CoRL.

[47]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[48]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[49]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[50]  Li Fei-Fei,et al.  Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations , 2020, Robotics: Science and Systems.

[51]  Diganta Misra Mish: A Self Regularized Non-Monotonic Activation Function , 2020, BMVC.

[52]  Santosh S. Vempala,et al.  Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices , 2019, NeurIPS.

[53]  Michael I. Jordan,et al.  Sampling can be faster than optimization , 2018, Proceedings of the National Academy of Sciences.

[54]  Varun Jog,et al.  Convexity of mutual information along the Ornstein-Uhlenbeck flow , 2018, 2018 International Symposium on Information Theory and Its Applications (ISITA).

[55]  Pieter Abbeel,et al.  An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[56]  Sergey Levine,et al.  Composable Deep Reinforcement Learning for Robotic Manipulation , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[57]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[58]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[59]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[60]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[61]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[62]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[63]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[64]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[65]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[66]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[67]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[68]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[69]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[70]  M. Ledoux,et al.  Analysis and Geometry of Markov Diffusion Operators , 2013 .

[71]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[72]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[73]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[74]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[75]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[76]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[77]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[78]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[79]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[80]  Geoffrey E. Hinton,et al.  Reinforcement Learning with Factored States and Actions , 2004, J. Mach. Learn. Res..

[81]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[82]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[83]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[84]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[85]  U. Haussmann,et al.  TIME REVERSAL OF DIFFUSIONS , 1986 .

[86]  B. Anderson Reverse-time diffusion equation models , 1982 .

[87]  S. Varadhan,et al.  Asymptotic evaluation of certain Markov process expectations for large time , 1975 .

[88]  A. Kolmogoroff Über die analytischen Methoden in der Wahrscheinlichkeitsrechnung , 1931 .

[89]  A. D. Fokker Die mittlere Energie rotierender elektrischer Dipole im Strahlungsfeld , 1914 .